In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import time as time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import matplotlib.cm as cm
from matplotlib.ticker import FormatStrFormatter
from itertools import product
import pandas as pd
from IPython.display import display, HTML

# feature analysis and selection
from sklearn.decomposition import PCA, KernelPCA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction import DictVectorizer

# Preprocessing
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, OneHotEncoder, Imputer
from sklearn.model_selection import train_test_split, cross_val_score

# Processing
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score
from sklearn import metrics

# SKLearn
from statsmodels.regression.linear_model import OLS
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
#from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

Data Fields

SOC, pH, Ca, P, Sand are the five target variables for predictions. The data have been monotonously transformed from the original measurements and thus include negative values.

  • PIDN: unique soil sample identifier

  • SOC: Soil organic carbon

  • pH: pH values
  • Ca: Mehlich-3 extractable Calcium
  • P: Mehlich-3 extractable Phosphorus

  • Sand: Sand content

  • m7497.96 - m599.76: There are 3,578 mid-infrared absorbance measurements. For example, the "m7497.96" column is the absorbance at wavenumber 7497.96 cm-1. We suggest you to remove spectra CO2 bands which are in the region m2379.76 to m2352.76, but you do not have to.

  • Depth: Depth of the soil sample (2 categories: "Topsoil", "Subsoil")

Some potential spatial predictors from remote sensing data sources are included. Short variable descriptions are provided below and additional descriptions can be found at AfSIS data. The data have been mean centered and scaled.

  • BSA: average long-term Black Sky Albedo measurements from MODIS satellite images (BSAN = near-infrared, BSAS = shortwave, BSAV = visible)
  • CTI: compound topographic index calculated from Shuttle Radar Topography Mission elevation data
  • ELEV: Shuttle Radar Topography Mission elevation data
  • EVI: average long-term Enhanced Vegetation Index from MODIS satellite images.
  • LST: average long-term Land Surface Temperatures from MODIS satellite images (LSTD = day time temperature, LSTN = night time temperature)
  • Ref: average long-term Reflectance measurements from MODIS satellite images (Ref1 = blue, Ref2 = red, Ref3 = near-infrared, Ref7 = mid-infrared)
  • Reli: topographic Relief calculated from Shuttle Radar Topography mission elevation data
  • TMAP & TMFI: average long-term Tropical Rainfall Monitoring Mission data (TMAP = mean annual precipitation, TMFI = modified Fournier index)

In [2]:
# Load training data

X = np.genfromtxt('training.csv', 
                  delimiter=',', 
                  dtype=None,
                  skip_header = 1,
                  usecols=range(1, 3594)) # Load columns 1 to 3594 inclusive

n = np.genfromtxt('training.csv', 
                  delimiter=',', 
                  max_rows = 1,
                  names = True,
                  usecols=range(1, 3594)) # Load columns 1 to 3594 inclusive
feature_names = np.asarray(n.dtype.names)

Depth = np.genfromtxt('training.csv',
                  delimiter=',', 
                  dtype=None,
                  skip_header = 1,
                  usecols=3594) # Load Depth values

PIDN = np.genfromtxt('training.csv',
                    delimiter=',',
                    dtype=None,
                    skip_header = 1,
                    usecols=0) # Load the PIDN for reference

Ca = np.genfromtxt('training.csv', 
                   delimiter=',', 
                   dtype=None,
                   skip_header = 1,
                   usecols=3595) # Load Mehlich-3 extractable Calcium data

P = np.genfromtxt('training.csv', 
                   delimiter=',', 
                   dtype=None,
                   skip_header = 1,
                   usecols=3596) # Load Mehlich-3 extractable Phosphorus data

pH = np.genfromtxt('training.csv', 
                   delimiter=',', 
                   dtype=None,
                   skip_header = 1,
                   usecols=3597) # Load pH data

SOC = np.genfromtxt('training.csv', 
                    delimiter=',', 
                    dtype=None,
                    skip_header = 1,
                    usecols=3598) # Load Soil Organic Carbon data

Sand = np.genfromtxt('training.csv', 
                     delimiter=',', 
                     dtype=None,
                     skip_header = 1,
                     usecols=3599) # Load Sand Content data

# Outcome (or response) variable list
y_var_labels = ['Ca', 'P', 'pH', 'SOC', 'Sand']
y_vars = [Ca, P, pH, SOC, Sand]

# Color map for outcome variables
colors = ['orange', 'yellowgreen', 'powderblue', 'sienna', 'tan']

In [3]:
# Load test data

test_x = np.genfromtxt('sorted_test.csv', 
                                delimiter=',', 
                                dtype=None,
                                skip_header = 1,
                                usecols=range(1, 3594)) # Load columns 0 to 3594 inclusive

test_depth = np.genfromtxt('sorted_test.csv',
                  delimiter=',', 
                  dtype=None,
                  skip_header = 1,
                  usecols=3594) # Load Depth values

test_ids = np.genfromtxt('sorted_test.csv', 
                                delimiter=',', 
                                dtype=None,
                                skip_header = 1,
                                usecols=0) # Load columns 0 to 3594 inclusive

In [4]:
# Transform depth and concatenate to X and test_x for use

le = LabelEncoder()
depth_enc = le.fit(Depth).transform(Depth).astype(np.float64)
test_depth_enc = le.fit(test_depth).transform(test_depth).astype(np.float64)

X_wDepth = np.concatenate((X, depth_enc.reshape(1,-1).T), axis=1)
test_x_wdepth = np.concatenate((test_x, test_depth_enc.reshape(1,-1).T), axis=1)

In [5]:
# Inspect the data shapes

print "Training data shape: ", X.shape
print "Feature name shape: ", feature_names.shape
print "PIDN data shape: ", PIDN.shape
print "Depth data shape: ", Depth.shape
print "Ca data shape: ", Ca.shape
print "P data shape: ", P.shape
print "pH data shape: ", pH.shape
print "SOC data shape: ", SOC.shape
print "Sand data shape: ", Sand.shape
print "Test data shape: ", test_x.shape
print "Test_ids shape: ", test_ids.shape


Training data shape:  (1157, 3593)
Feature name shape:  (3593,)
PIDN data shape:  (1157,)
Depth data shape:  (1157,)
Ca data shape:  (1157,)
P data shape:  (1157,)
pH data shape:  (1157,)
SOC data shape:  (1157,)
Sand data shape:  (1157,)
Test data shape:  (727, 3593)
Test_ids shape:  (727,)

In [6]:
# Inspect the data in the five response variables

print "Ca: total = %d, max = %0.2f, mean = %0.2f, min = %0.2f" % (Ca.shape[0], np.max(Ca), np.mean(Ca), np.min(Ca))
print "P: total = %d, max = %0.2f, mean = %0.2f, min = %0.2f" % (P.shape[0], np.max(P), np.mean(P), np.min(P))
print "pH: total = %d, max = %0.2f, mean = %0.2f, min = %0.2f" % (pH.shape[0], np.max(pH), np.mean(pH), np.min(pH))
print "SOC: total = %d, max = %0.2f, mean = %0.2f, min = %0.2f" % (SOC.shape[0], np.max(SOC), np.mean(SOC), np.min(SOC))
print "Sand: total = %d, max = %0.2f, mean = %0.2f, min = %0.2f" % (Sand.shape[0], np.max(Sand), 
                                                                  np.mean(Sand), np.min(Sand))

def plot_hist(ind, data, max_y, title, color):

    counts, bins, patches = ax[ind].hist(data, facecolor=color, edgecolor='gray')
    # set the ticks to be at the edges of the bins.
    ax[ind].set_xticks(bins)
    # set the limits for x and y
    ax[ind].set_xlim([np.min(data),np.max(data)])
    ax[ind].set_ylim([0,max_y])
    # set the xaxis's tick labels to be formatted with 1 decimal place
    ax[ind].xaxis.set_major_formatter(FormatStrFormatter('%0.1f'))
    ax[ind].set_title(title, fontsize=18)
  
    # Label the raw counts and the percentages below the x-axis
    bin_centers = 0.5 * np.diff(bins) + bins[:-1]
    for count, x in zip(counts, bin_centers):
        # Label the raw counts
        ax[ind].annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'),
            xytext=(0, -18), textcoords='offset points', va='top', ha='center')

        # Label the percentages
        percent = '%0.1f%%' % (100 * float(count) / counts.sum())
        ax[ind].annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
            xytext=(0, -32), textcoords='offset points', va='top', ha='center')


fig, ax = plt.subplots(3, 2, figsize=(15, 20))
fig.subplots_adjust(hspace = 0.5, wspace=.2)
ax = ax.ravel()

# Ca
plot_hist(0, Ca, Ca.shape[0], 'Ca Value Histogram', colors[0])
# P
plot_hist(1, P, P.shape[0], 'P Value Histogram', colors[1])
#pH
plot_hist(2, pH, pH.shape[0], 'pH Value Histogram', colors[2])
#SOC
plot_hist(3, SOC, SOC.shape[0], 'SOC Value Histogram', colors[3])
#Sand
plot_hist(4, Sand, Sand.shape[0], 'Sand Value Histogram', colors[4])
# delete the last subplot
fig.delaxes(ax[5])


Ca: total = 1157, max = 9.65, mean = 0.01, min = -0.54
P: total = 1157, max = 13.27, mean = -0.01, min = -0.42
pH: total = 1157, max = 3.42, mean = -0.03, min = -1.89
SOC: total = 1157, max = 7.62, mean = 0.08, min = -0.86
Sand: total = 1157, max = 2.25, mean = -0.01, min = -1.49

In [7]:
# Inspect the data in the predictor variables

def plot_data(ind, data_x, data_y, aspect, title, color):
    
    ax[ind].set_title(title, fontsize=18)
    ax[ind].set_xlabel('Predictor Values', fontsize=14)
    ax[ind].set_ylabel('Ca Values', fontsize=12)
    ax[ind].set_aspect(aspect = aspect, adjustable='box')
    ax[ind].grid(True)
    ax[ind].scatter(data_x, data_y, color = color, alpha = 0.2, marker = 'o', edgecolors = 'black')

# set up the grid plot
fig, ax = plt.subplots(2, 3, figsize=(15, 20))
#fig.subplots_adjust(hspace = 0.5, wspace=.2)
ax = ax.ravel()

# select the predictor range (note this is influenced by the PCA below)
my_col = 20

X_sub = np.ravel(X[:,:my_col].reshape(-1,1))

# Ca 
plot_data(0, X_sub, np.repeat(Ca, my_col), 0.1, 'Ca vs. %d Predictors' % my_col, colors[0])
# P 
plot_data(1, X_sub, np.repeat(P, my_col), 0.1, 'P vs. %d Predictors' % my_col, colors[1])
#pH 
plot_data(2, X_sub, np.repeat(pH, my_col), 0.2, 'pH vs. %d Predictors' % my_col, colors[2])
#SOC 
plot_data(3, X_sub, np.repeat(SOC, my_col), 0.1, 'SOC vs. %d Predictors' % my_col, colors[3])
#Sand 
plot_data(4, X_sub, np.repeat(Sand, my_col), 0.35, 'Sand vs. %d Predictors' % my_col, colors[4])
# delete the last subplot
fig.delaxes(ax[5])


Which features have more impact?

There are over three thousand features in this data, with few rows. Thus, we have a large k but small n data set to work with. Perhaps there is a subset of features to focus on.

Below, we investigate two variations of PCA to explain variances over the features. We observe that the first 20 components explain increasing portions of the variance, however after 20 components, the subsequent ones don't really help. The first 70-80 features will explain ~100% of the variance.


In [8]:
# Linear PCA using all of the features
n_comp = feature_names.shape[0]
pca_lin = PCA(n_components = n_comp)
pca_lin.fit(X)
pca_lin_cumsum = np.cumsum(pca_lin.explained_variance_ratio_)

# Non-linear kernel RBF PCA using all of the features 
pca_kern = KernelPCA(n_components = n_comp, kernel = 'rbf')
pca_kern.fit(X)

# build the explained variance ratio list for pca_kern
explained_var_ratio_kern = []
for i in range(0, pca_kern.lambdas_.shape[0]):
    explained_var_ratio_kern.append(pca_kern.lambdas_[i]/sum(pca_kern.lambdas_))
pca_kern_cumsum = np.cumsum(np.asarray(explained_var_ratio_kern))

# Plot the Information Gain graph
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(pca_lin_cumsum, color = 'purple', marker = 'o', ms = 5, mfc = 'red', label = 'pca_lin')
ax.plot(pca_kern_cumsum, color = 'purple', marker = 'o', ms = 5, mfc = 'yellow', label = 'pca_kern')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.91), shadow=False, scatterpoints=1)
fig.suptitle('Cummulative Information Gain', fontsize=18)
plt.xlabel('Number of Components', fontsize=14)
plt.ylabel('Cummulative Variance Ratio', fontsize=12)
plt.grid(True)
ax.set_xlim([0,30])
ax.set_ylim([0.5,1.0])

# Output variance fractions
print '\n-------------------------------------------'
print 'Fraction of the total variance in the training explained by first k components: \n'
for k in range(1,76): 
    print("%d \t %s \t %s" % (k, '{0:.2f}%'.format(pca_lin_cumsum[k-1] * 100), 
                                    '{0:.2f}%'.format(pca_kern_cumsum[k-1] * 100)))


-------------------------------------------
Fraction of the total variance in the training explained by first k components: 

1 	 70.54% 	 66.91%
2 	 79.38% 	 75.72%
3 	 85.54% 	 82.10%
4 	 89.31% 	 85.88%
5 	 91.69% 	 88.83%
6 	 93.63% 	 90.85%
7 	 95.14% 	 92.57%
8 	 96.16% 	 94.10%
9 	 96.75% 	 95.11%
10 	 97.27% 	 95.74%
11 	 97.74% 	 96.29%
12 	 98.10% 	 96.76%
13 	 98.42% 	 97.14%
14 	 98.69% 	 97.49%
15 	 98.90% 	 97.77%
16 	 99.06% 	 98.00%
17 	 99.20% 	 98.21%
18 	 99.34% 	 98.40%
19 	 99.44% 	 98.55%
20 	 99.50% 	 98.69%
21 	 99.56% 	 98.80%
22 	 99.62% 	 98.90%
23 	 99.67% 	 98.99%
24 	 99.72% 	 99.07%
25 	 99.75% 	 99.14%
26 	 99.78% 	 99.20%
27 	 99.81% 	 99.26%
28 	 99.83% 	 99.31%
29 	 99.85% 	 99.36%
30 	 99.86% 	 99.40%
31 	 99.88% 	 99.43%
32 	 99.89% 	 99.46%
33 	 99.90% 	 99.49%
34 	 99.91% 	 99.52%
35 	 99.92% 	 99.54%
36 	 99.93% 	 99.57%
37 	 99.93% 	 99.59%
38 	 99.94% 	 99.61%
39 	 99.94% 	 99.63%
40 	 99.95% 	 99.65%
41 	 99.95% 	 99.67%
42 	 99.96% 	 99.69%
43 	 99.96% 	 99.70%
44 	 99.96% 	 99.71%
45 	 99.96% 	 99.73%
46 	 99.97% 	 99.74%
47 	 99.97% 	 99.75%
48 	 99.97% 	 99.76%
49 	 99.97% 	 99.77%
50 	 99.98% 	 99.78%
51 	 99.98% 	 99.79%
52 	 99.98% 	 99.80%
53 	 99.98% 	 99.80%
54 	 99.98% 	 99.81%
55 	 99.98% 	 99.82%
56 	 99.98% 	 99.83%
57 	 99.99% 	 99.83%
58 	 99.99% 	 99.84%
59 	 99.99% 	 99.84%
60 	 99.99% 	 99.85%
61 	 99.99% 	 99.85%
62 	 99.99% 	 99.86%
63 	 99.99% 	 99.86%
64 	 99.99% 	 99.87%
65 	 99.99% 	 99.87%
66 	 99.99% 	 99.88%
67 	 99.99% 	 99.88%
68 	 99.99% 	 99.88%
69 	 99.99% 	 99.89%
70 	 99.99% 	 99.89%
71 	 99.99% 	 99.89%
72 	 99.99% 	 99.90%
73 	 99.99% 	 99.90%
74 	 99.99% 	 99.90%
75 	 100.00% 	 99.90%

In [9]:
# Linear Regression with PCA combinations

y_pipelines_lin = []
y_scores_lin = []

start = time.time()
for ind, y in enumerate(y_vars):

    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    # set up the train and test data
    print '\n----------', y_var_labels[ind]

    pca = PCA()
    linear = LinearRegression()
    steps = [('pca', pca), ('linear', linear)]
    pipeline = Pipeline(steps)

    parameters = dict(pca__n_components=list(range(20, 90, 10)))

    cv = GridSearchCV(pipeline, param_grid=parameters, verbose=0)
    cv.fit(X_train, y_train)   

    print 'Cross_val_score: ', cross_val_score(cv, X_test, y_test)
    
    y_predictions = cv.predict(X_test)
    mse = mean_squared_error(y_test, y_predictions)
    print 'Explained variance score: ', explained_variance_score(y_test, y_predictions)
    print 'Mean absolute error: ', mean_absolute_error(y_test, y_predictions)
    print 'Mean squared error: ', mse
    print 'R2 score: ', r2_score(y_test, y_predictions)
    
    display(pd.DataFrame.from_dict(cv.cv_results_))
    
    # capture the best pipeline estimator and mse value
    y_pipelines_lin.append(cv.best_estimator_)
    y_scores_lin.append(mse)
    
print '\nCompleted in %0.2f sec.' % (time.time()-start)


---------- Ca
Cross_val_score:  [ 0.9018189   0.89352108  0.84671989]
Explained variance score:  0.876967925491
Mean absolute error:  0.17015727213
Mean squared error:  0.0996165487777
R2 score:  0.875869499556
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 1.086314 0.025823 0.830764 0.849368 20 {u'pca__n_components': 20} 7 0.817902 0.863425 0.837805 0.838261 0.836584 0.846418 0.715298 0.000262 0.009108 0.010482
1 0.620967 0.027342 0.861632 0.895200 30 {u'pca__n_components': 30} 6 0.856921 0.908852 0.876581 0.872975 0.851395 0.903772 0.014257 0.003937 0.010808 0.015851
2 0.782542 0.030695 0.882080 0.921041 40 {u'pca__n_components': 40} 1 0.874330 0.930445 0.908972 0.910123 0.862939 0.922555 0.040136 0.000726 0.019576 0.008365
3 0.894140 0.055397 0.881513 0.928340 50 {u'pca__n_components': 50} 2 0.871773 0.936191 0.903384 0.921105 0.869381 0.927724 0.083426 0.023754 0.015496 0.006174
4 0.765452 0.039704 0.870816 0.933811 60 {u'pca__n_components': 60} 3 0.867030 0.939589 0.892774 0.928650 0.852644 0.933194 0.105066 0.004127 0.016600 0.004487
5 0.746985 0.041109 0.870122 0.943801 70 {u'pca__n_components': 70} 4 0.863644 0.948598 0.889492 0.945034 0.857231 0.937772 0.034318 0.005162 0.013945 0.004505
6 0.840365 0.041170 0.864378 0.948144 80 {u'pca__n_components': 80} 5 0.871274 0.953136 0.889601 0.948387 0.832261 0.942908 0.029675 0.002179 0.023911 0.004179
---------- P
Cross_val_score:  [  2.98950704e-03  -4.84013854e+00  -1.99501612e-01]
Explained variance score:  0.138930956868
Mean absolute error:  0.453004842422
Mean squared error:  1.47693037109
R2 score:  0.137680917548
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.650554 0.029594 0.047381 0.116608 20 {u'pca__n_components': 20} 3 0.032295 0.120625 0.102160 0.093096 0.007686 0.136102 0.042979 0.002762 0.040017 0.017785
1 0.662778 0.027960 0.038887 0.142683 30 {u'pca__n_components': 30} 5 -0.016329 0.155570 0.112529 0.115054 0.020460 0.157425 0.060908 0.003387 0.054196 0.019551
2 0.888873 0.036900 0.052904 0.181353 40 {u'pca__n_components': 40} 2 0.002227 0.187685 0.148511 0.149511 0.007974 0.206863 0.108081 0.014025 0.067645 0.023838
3 0.823283 0.034193 0.042726 0.234671 50 {u'pca__n_components': 50} 4 -0.071456 0.259544 0.177242 0.190185 0.022392 0.254282 0.049633 0.000861 0.102544 0.031529
4 0.731970 0.045471 0.072215 0.273591 60 {u'pca__n_components': 60} 1 -0.012869 0.298118 0.205240 0.220625 0.024273 0.302031 0.027742 0.012251 0.095277 0.037487
5 0.748640 0.042000 0.036456 0.310599 70 {u'pca__n_components': 70} 6 -0.089884 0.337609 0.236107 0.248127 -0.036854 0.346060 0.051708 0.002328 0.142825 0.044309
6 0.822885 0.041014 0.023835 0.356060 80 {u'pca__n_components': 80} 7 -0.221002 0.418672 0.264379 0.264147 0.028129 0.385361 0.016508 0.008210 0.198179 0.066400
---------- pH
Cross_val_score:  [ 0.775338    0.75690956  0.70801707]
Explained variance score:  0.831625973523
Mean absolute error:  0.300857557799
Mean squared error:  0.148443906652
R2 score:  0.830201059591
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.760764 0.040453 0.700911 0.719188 20 {u'pca__n_components': 20} 7 0.688433 0.728797 0.710065 0.711548 0.704234 0.717220 0.079586 0.012664 0.009139 0.007178
1 0.722684 0.045401 0.761708 0.786851 30 {u'pca__n_components': 30} 6 0.734473 0.800396 0.758960 0.789438 0.791692 0.770718 0.077800 0.011477 0.023440 0.012254
2 1.016979 0.032707 0.767811 0.805886 40 {u'pca__n_components': 40} 5 0.754227 0.813934 0.773860 0.807089 0.775345 0.796636 0.113929 0.000899 0.009624 0.007113
3 0.845340 0.047574 0.775525 0.819959 50 {u'pca__n_components': 50} 4 0.748196 0.833496 0.787548 0.816268 0.790832 0.810112 0.063723 0.007952 0.019371 0.009897
4 0.720004 0.036166 0.788143 0.834645 60 {u'pca__n_components': 60} 2 0.767813 0.846194 0.790665 0.837067 0.805951 0.820674 0.022982 0.003365 0.015672 0.010558
5 0.710926 0.040429 0.785077 0.844342 70 {u'pca__n_components': 70} 3 0.774255 0.852076 0.773732 0.844686 0.807244 0.836264 0.008208 0.001303 0.015676 0.006460
6 0.789773 0.042149 0.794414 0.855475 80 {u'pca__n_components': 80} 1 0.783200 0.860301 0.795027 0.857435 0.805015 0.848688 0.002127 0.004800 0.008916 0.004939
---------- SOC
Cross_val_score:  [ 0.78563623  0.80437933  0.84069949]
Explained variance score:  0.878020204418
Mean absolute error:  0.233259012387
Mean squared error:  0.168214645396
R2 score:  0.877946841399
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.749873 0.039931 0.808688 0.844451 20 {u'pca__n_components': 20} 7 0.813357 0.846973 0.782103 0.842297 0.830603 0.844084 0.138310 0.017094 0.020073 0.001927
1 0.715325 0.043402 0.835912 0.862136 30 {u'pca__n_components': 30} 6 0.825580 0.866978 0.843852 0.854666 0.838303 0.864763 0.111403 0.008017 0.007649 0.005359
2 0.979997 0.082770 0.838050 0.872848 40 {u'pca__n_components': 40} 5 0.827372 0.875439 0.862314 0.860152 0.824462 0.882954 0.067699 0.055805 0.017199 0.009488
3 0.872713 0.047885 0.861558 0.901625 50 {u'pca__n_components': 50} 4 0.857239 0.903024 0.876040 0.901725 0.851395 0.900125 0.042844 0.007057 0.010514 0.001186
4 0.758939 0.040918 0.876587 0.920156 60 {u'pca__n_components': 60} 3 0.881751 0.924166 0.881737 0.911470 0.866274 0.924832 0.093803 0.000766 0.007293 0.006148
5 0.969667 0.075004 0.888458 0.932481 70 {u'pca__n_components': 70} 2 0.883041 0.936020 0.898859 0.920715 0.883474 0.940710 0.220412 0.057450 0.007357 0.008538
6 0.853804 0.045735 0.890480 0.940633 80 {u'pca__n_components': 80} 1 0.884931 0.945019 0.892262 0.929626 0.894246 0.947252 0.036937 0.003606 0.004006 0.007836
---------- Sand
Cross_val_score:  [ 0.78615055  0.79223851  0.74078617]
Explained variance score:  0.876747476308
Mean absolute error:  0.251868652818
Mean squared error:  0.129080937319
R2 score:  0.876313748199
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.700024 0.030221 0.739525 0.754093 20 {u'pca__n_components': 20} 7 0.727276 0.759576 0.755578 0.755534 0.735720 0.747170 0.105593 0.005513 0.011863 0.005166
1 0.621172 0.036462 0.804791 0.821229 30 {u'pca__n_components': 30} 6 0.810018 0.817824 0.819448 0.816885 0.784907 0.828979 0.034206 0.002352 0.014578 0.005493
2 0.900693 0.036882 0.822155 0.844711 40 {u'pca__n_components': 40} 5 0.814481 0.845421 0.846389 0.836246 0.805594 0.852467 0.056999 0.002138 0.017516 0.006641
3 0.907363 0.033232 0.837170 0.871940 50 {u'pca__n_components': 50} 4 0.828071 0.876881 0.857786 0.864384 0.825653 0.874554 0.083856 0.005079 0.014611 0.005426
4 0.876933 0.048775 0.846526 0.885516 60 {u'pca__n_components': 60} 3 0.826955 0.890923 0.866518 0.881973 0.846107 0.883652 0.124213 0.014373 0.016154 0.003884
5 0.844855 0.044044 0.848578 0.893577 70 {u'pca__n_components': 70} 2 0.832872 0.900571 0.859049 0.890222 0.853814 0.889939 0.133375 0.005507 0.011310 0.004946
6 0.898456 0.063582 0.852341 0.902965 80 {u'pca__n_components': 80} 1 0.834918 0.913065 0.866175 0.897098 0.855931 0.898732 0.059424 0.026996 0.013011 0.007173
Completed in 203.37 sec.

In [10]:
print len(y_pipelines_lin)
print y_scores_lin


5
[0.09961654877768189, 1.4769303710945778, 0.14844390665164739, 0.16821464539593423, 0.12908093731872916]

In [11]:
# Linear Regression with PCA and SelectKBest

y_pipelines_linsel = []
y_scores_linsel = []

start = time.time()
for ind, y in enumerate(y_vars):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    # set up the train and test data
    print '\n----------', y_var_labels[ind]

    pca = PCA(n_components=2)
    selection = SelectKBest(k=1)
    combined_features = FeatureUnion([('pca', pca), ('univ_select', selection)])
    linear = LinearRegression()
    
    steps = [('features', combined_features), ('linear', linear)]
    pipeline = Pipeline(steps)

    parameters = dict(features__pca__n_components=list(range(20, 90, 10)),
                  features__univ_select__k=[1, 2, 3])

    cv = GridSearchCV(pipeline, param_grid=parameters, verbose=0)
    cv.fit(X_train, y_train)   

    print 'Cross_val_score: ', cross_val_score(cv, X_test, y_test)
    
    y_predictions = cv.predict(X_test)
    mse = mean_squared_error(y_test, y_predictions)
    print 'Explained variance score: ', explained_variance_score(y_test, y_predictions)
    print 'Mean absolute error: ', mean_absolute_error(y_test, y_predictions)
    print 'Mean squared error: ', mse
    print 'R2 score: ', r2_score(y_test, y_predictions)
    
    display(pd.DataFrame.from_dict(cv.cv_results_))
    
    # capture the best pipeline estimator and mse value
    y_pipelines_linsel.append(cv.best_estimator_)
    y_scores_linsel.append(mse)
    
print '\nCompleted in %0.2f sec.' % (time.time()-start)


---------- Ca
Cross_val_score:  [ 0.90374986  0.94616482  0.93589925]
Explained variance score:  0.865173915803
Mean absolute error:  0.202007953992
Mean squared error:  0.267685923343
R2 score:  0.863010088778
mean_fit_time mean_score_time mean_test_score mean_train_score param_features__pca__n_components param_features__univ_select__k params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 1.079391 0.052765 0.822891 0.857318 20 1 {u'features__pca__n_components': 20, u'feature... 20 0.808843 0.845454 0.847154 0.860913 0.812677 0.865587 0.082361 0.031317 0.017228 0.008603
1 0.928950 0.028119 0.820936 0.857709 20 2 {u'features__pca__n_components': 20, u'feature... 21 0.803320 0.846183 0.846422 0.861351 0.813066 0.865594 0.145668 0.004451 0.018455 0.008332
2 0.850011 0.031605 0.824634 0.859714 20 3 {u'features__pca__n_components': 20, u'feature... 19 0.814378 0.851624 0.847368 0.861682 0.812157 0.865835 0.052487 0.000881 0.016101 0.005966
3 0.836940 0.067183 0.847035 0.885473 30 1 {u'features__pca__n_components': 30, u'feature... 18 0.830067 0.876533 0.866153 0.889061 0.844887 0.890825 0.021752 0.022987 0.014810 0.006362
4 0.873259 0.034247 0.847552 0.886349 30 2 {u'features__pca__n_components': 30, u'feature... 17 0.828181 0.876636 0.868172 0.889908 0.846304 0.892504 0.082614 0.002320 0.016350 0.006950
5 0.866009 0.037429 0.856353 0.889720 30 3 {u'features__pca__n_components': 30, u'feature... 16 0.835322 0.878314 0.868555 0.889957 0.865181 0.900887 0.032872 0.003128 0.014935 0.009217
6 0.998366 0.042332 0.876610 0.905984 40 1 {u'features__pca__n_components': 40, u'feature... 7 0.879630 0.902132 0.879311 0.906407 0.870889 0.909412 0.045487 0.003007 0.004047 0.002987
7 0.963970 0.044331 0.875042 0.907181 40 2 {u'features__pca__n_components': 40, u'feature... 9 0.875981 0.902961 0.877871 0.906851 0.871273 0.911730 0.026155 0.012439 0.002774 0.003587
8 1.020491 0.040956 0.876025 0.907419 40 3 {u'features__pca__n_components': 40, u'feature... 8 0.877778 0.903450 0.877824 0.906867 0.872472 0.911940 0.051416 0.002136 0.002512 0.003488
9 1.029200 0.045445 0.876982 0.913082 50 1 {u'features__pca__n_components': 50, u'feature... 4 0.876574 0.910058 0.877372 0.911241 0.876999 0.917948 0.003789 0.001764 0.000326 0.003474
10 1.030992 0.040242 0.876879 0.913499 50 2 {u'features__pca__n_components': 50, u'feature... 6 0.875379 0.910379 0.876399 0.911585 0.878860 0.918534 0.008096 0.002154 0.001461 0.003594
11 1.075547 0.038200 0.876955 0.913564 50 3 {u'features__pca__n_components': 50, u'feature... 5 0.875880 0.910447 0.876199 0.911634 0.878786 0.918611 0.011438 0.002344 0.001301 0.003602
12 0.899459 0.045963 0.874489 0.920564 60 1 {u'features__pca__n_components': 60, u'feature... 10 0.875305 0.917646 0.876340 0.916938 0.871821 0.927109 0.050380 0.001090 0.001933 0.004637
13 0.863758 0.046731 0.873473 0.921176 60 2 {u'features__pca__n_components': 60, u'feature... 11 0.875190 0.918017 0.875739 0.917020 0.869489 0.928491 0.013727 0.002289 0.002826 0.005188
14 0.886532 0.042884 0.873370 0.921219 60 3 {u'features__pca__n_components': 60, u'feature... 12 0.874646 0.918148 0.875284 0.917200 0.870180 0.928310 0.034260 0.001320 0.002271 0.005029
15 0.912289 0.047019 0.880088 0.926909 70 1 {u'features__pca__n_components': 70, u'feature... 2 0.886777 0.923210 0.870760 0.926361 0.882727 0.931157 0.039233 0.002751 0.006800 0.003267
16 0.898165 0.043206 0.879901 0.927311 70 2 {u'features__pca__n_components': 70, u'feature... 3 0.886499 0.923638 0.871658 0.926307 0.881545 0.931988 0.034907 0.002653 0.006169 0.003482
17 0.899727 0.046057 0.880286 0.927651 70 3 {u'features__pca__n_components': 70, u'feature... 1 0.886804 0.923773 0.871175 0.926891 0.882879 0.932287 0.042226 0.002107 0.006638 0.003517
18 1.028562 0.045476 0.872247 0.930915 80 1 {u'features__pca__n_components': 80, u'feature... 14 0.873325 0.926832 0.856579 0.932679 0.886835 0.933234 0.034828 0.001571 0.012375 0.002896
19 1.089867 0.051428 0.871997 0.931388 80 2 {u'features__pca__n_components': 80, u'feature... 15 0.873136 0.926950 0.856605 0.932685 0.886250 0.934528 0.125310 0.001927 0.012129 0.003227
20 1.107853 0.054798 0.872933 0.931726 80 3 {u'features__pca__n_components': 80, u'feature... 13 0.873491 0.926986 0.858255 0.933238 0.887054 0.934953 0.040910 0.006066 0.011763 0.003424
---------- P
Cross_val_score:  [-0.0786123  -0.19400468 -0.21609633]
Explained variance score:  -0.207038283646
Mean absolute error:  0.399216664639
Mean squared error:  0.462757982131
R2 score:  -0.224795875127
mean_fit_time mean_score_time mean_test_score mean_train_score param_features__pca__n_components param_features__univ_select__k params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 1.081165 0.040394 -0.026545 0.139378 20 1 {u'features__pca__n_components': 20, u'feature... 1 -0.142351 0.165917 0.024253 0.124440 0.038463 0.127778 0.192622 0.014015 0.082092 0.018815
1 0.923877 0.053861 -0.028005 0.140065 20 2 {u'features__pca__n_components': 20, u'feature... 2 -0.145169 0.166827 0.022308 0.124846 0.038845 0.128520 0.089977 0.027099 0.083122 0.018983
2 0.978557 0.045699 -0.028562 0.141513 20 3 {u'features__pca__n_components': 20, u'feature... 3 -0.146024 0.168131 0.016953 0.125352 0.043384 0.131055 0.160302 0.009646 0.083756 0.018965
3 0.892768 0.046006 -0.055104 0.174435 30 1 {u'features__pca__n_components': 30, u'feature... 4 -0.167592 0.205667 -0.062961 0.176593 0.065243 0.141047 0.056857 0.009500 0.095217 0.026425
4 0.833294 0.045824 -0.063798 0.185436 30 2 {u'features__pca__n_components': 30, u'feature... 5 -0.184293 0.218022 -0.081218 0.185071 0.074117 0.153215 0.027718 0.008232 0.106212 0.026459
5 0.947285 0.060993 -0.077946 0.188714 30 3 {u'features__pca__n_components': 30, u'feature... 7 -0.187259 0.218484 -0.122917 0.194152 0.076338 0.153508 0.111869 0.027372 0.112213 0.026804
6 1.035036 0.052901 -0.092917 0.229946 40 1 {u'features__pca__n_components': 40, u'feature... 8 -0.194073 0.263440 -0.153430 0.227626 0.068752 0.198772 0.057000 0.005898 0.115515 0.026452
7 1.055359 0.042741 -0.136081 0.236091 40 2 {u'features__pca__n_components': 40, u'feature... 10 -0.224384 0.272491 -0.253223 0.236183 0.069364 0.199598 0.079217 0.003988 0.145748 0.029759
8 1.031260 0.069399 -0.150126 0.240690 40 3 {u'features__pca__n_components': 40, u'feature... 12 -0.235244 0.281030 -0.281390 0.239538 0.066255 0.201503 0.033281 0.031867 0.154160 0.032477
9 1.401635 0.051331 -0.074266 0.270746 50 1 {u'features__pca__n_components': 50, u'feature... 6 -0.232171 0.301679 -0.068769 0.262166 0.078142 0.248393 0.289035 0.011113 0.126744 0.022584
10 1.267741 0.055768 -0.118227 0.273752 50 2 {u'features__pca__n_components': 50, u'feature... 9 -0.259600 0.304058 -0.170557 0.268046 0.075475 0.249151 0.135833 0.008019 0.141710 0.022776
11 1.172220 0.045004 -0.140754 0.277701 50 3 {u'features__pca__n_components': 50, u'feature... 11 -0.308023 0.314301 -0.188299 0.268923 0.074059 0.249879 0.031784 0.002244 0.159566 0.027023
12 1.058645 0.057014 -0.155833 0.332741 60 1 {u'features__pca__n_components': 60, u'feature... 14 -0.343850 0.362131 -0.231486 0.315916 0.107836 0.320176 0.140464 0.008678 0.192003 0.020855
13 1.089270 0.056993 -0.155580 0.333209 60 2 {u'features__pca__n_components': 60, u'feature... 13 -0.343648 0.362811 -0.226067 0.316430 0.102975 0.320386 0.054878 0.008112 0.189023 0.020994
14 1.493007 0.055981 -0.165539 0.342890 60 3 {u'features__pca__n_components': 60, u'feature... 15 -0.389607 0.390110 -0.210278 0.316402 0.103269 0.322157 0.647129 0.011166 0.203688 0.033472
15 1.307175 0.061642 -0.180015 0.385702 70 1 {u'features__pca__n_components': 70, u'feature... 19 -0.396567 0.432867 -0.251416 0.366843 0.107936 0.357395 0.142735 0.012192 0.212060 0.033573
16 1.020342 0.063737 -0.198132 0.388474 70 2 {u'features__pca__n_components': 70, u'feature... 20 -0.397569 0.432795 -0.296383 0.371773 0.099555 0.360853 0.104243 0.001796 0.214512 0.031656
17 1.096178 0.059621 -0.219425 0.390673 70 3 {u'features__pca__n_components': 70, u'feature... 21 -0.406198 0.435102 -0.342924 0.374784 0.090847 0.362134 0.028407 0.005863 0.220911 0.031837
18 1.252645 0.051502 -0.175929 0.429770 80 1 {u'features__pca__n_components': 80, u'feature... 17 -0.368348 0.440276 -0.299451 0.402842 0.140011 0.446192 0.149162 0.007742 0.225167 0.019193
19 1.344964 0.051955 -0.178312 0.430575 80 2 {u'features__pca__n_components': 80, u'feature... 18 -0.367480 0.441558 -0.300132 0.402435 0.132675 0.447730 0.059225 0.005658 0.221613 0.020057
20 1.180695 0.059849 -0.174038 0.434133 80 3 {u'features__pca__n_components': 80, u'feature... 16 -0.342695 0.446771 -0.305069 0.406055 0.125650 0.449574 0.037165 0.007125 0.212467 0.019887
---------- pH
Cross_val_score:  [ 0.71784612  0.49996878  0.58810445]
Explained variance score:  0.726793952716
Mean absolute error:  0.305946936983
Mean squared error:  0.210756223603
R2 score:  0.725082293999
mean_fit_time mean_score_time mean_test_score mean_train_score param_features__pca__n_components param_features__univ_select__k params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.992496 0.033371 0.725147 0.749074 20 1 {u'features__pca__n_components': 20, u'feature... 21 0.689232 0.762567 0.749796 0.738003 0.736413 0.746650 0.142253 0.002302 0.025977 0.010174
1 0.865931 0.041709 0.729087 0.752808 20 2 {u'features__pca__n_components': 20, u'feature... 20 0.692125 0.762925 0.760305 0.748206 0.734832 0.747291 0.067894 0.010047 0.028129 0.007164
2 0.911681 0.040016 0.730316 0.760688 20 3 {u'features__pca__n_components': 20, u'feature... 19 0.693617 0.766966 0.758986 0.752977 0.738343 0.762122 0.064165 0.011113 0.027284 0.005800
3 0.915488 0.041886 0.778122 0.809184 30 1 {u'features__pca__n_components': 30, u'feature... 18 0.763391 0.817309 0.785207 0.806574 0.785769 0.803670 0.033230 0.002713 0.010419 0.005866
4 1.104462 0.073896 0.779922 0.811627 30 2 {u'features__pca__n_components': 30, u'feature... 17 0.764090 0.817342 0.787790 0.812590 0.787887 0.804950 0.178665 0.040433 0.011195 0.005105
5 1.098536 0.052381 0.781757 0.814944 30 3 {u'features__pca__n_components': 30, u'feature... 16 0.764478 0.820065 0.787048 0.815079 0.793745 0.809686 0.152631 0.009815 0.012520 0.004238
6 1.213015 0.038476 0.791412 0.824095 40 1 {u'features__pca__n_components': 40, u'feature... 15 0.774093 0.830769 0.801612 0.821774 0.798531 0.819742 0.048849 0.004802 0.012311 0.004791
7 1.145088 0.053016 0.793347 0.825529 40 2 {u'features__pca__n_components': 40, u'feature... 13 0.776066 0.830957 0.804015 0.825421 0.799960 0.820210 0.097420 0.005838 0.012331 0.004388
8 1.231815 0.047226 0.792171 0.827008 40 3 {u'features__pca__n_components': 40, u'feature... 14 0.775107 0.831663 0.803376 0.827712 0.798029 0.821650 0.050253 0.006225 0.012262 0.004118
9 1.169220 0.052938 0.798990 0.840436 50 1 {u'features__pca__n_components': 50, u'feature... 12 0.780284 0.840623 0.800591 0.842389 0.816095 0.838296 0.080476 0.012546 0.014664 0.001676
10 1.099298 0.044664 0.799491 0.841217 50 2 {u'features__pca__n_components': 50, u'feature... 11 0.780657 0.840640 0.801731 0.844734 0.816084 0.838276 0.017244 0.001717 0.014550 0.002668
11 1.244722 0.049428 0.802460 0.842472 50 3 {u'features__pca__n_components': 50, u'feature... 10 0.784913 0.841744 0.801789 0.845512 0.820678 0.840161 0.047040 0.007609 0.014609 0.002245
12 1.142129 0.051960 0.817109 0.855183 60 1 {u'features__pca__n_components': 60, u'feature... 8 0.809898 0.854756 0.810046 0.857023 0.831384 0.853771 0.154142 0.001554 0.010094 0.001362
13 1.081715 0.047175 0.817371 0.855428 60 2 {u'features__pca__n_components': 60, u'feature... 6 0.810772 0.854786 0.809747 0.857620 0.831592 0.853879 0.043891 0.005883 0.010065 0.001593
14 1.031306 0.048147 0.818297 0.855831 60 3 {u'features__pca__n_components': 60, u'feature... 4 0.811577 0.854887 0.811458 0.858778 0.831858 0.853829 0.101945 0.003807 0.009589 0.002128
15 1.105018 0.060102 0.817295 0.862471 70 1 {u'features__pca__n_components': 70, u'feature... 7 0.826793 0.864130 0.800341 0.864430 0.824752 0.858854 0.077442 0.009677 0.012017 0.002561
16 1.281273 0.051500 0.815957 0.862931 70 2 {u'features__pca__n_components': 70, u'feature... 9 0.825919 0.863639 0.797067 0.866312 0.824883 0.858844 0.164085 0.002231 0.013363 0.003090
17 1.164358 0.060739 0.817621 0.863314 70 3 {u'features__pca__n_components': 70, u'feature... 5 0.827264 0.864044 0.799082 0.866651 0.826517 0.859248 0.147508 0.013093 0.013112 0.003066
18 1.133511 0.056370 0.831326 0.874966 80 1 {u'features__pca__n_components': 80, u'feature... 2 0.833632 0.869338 0.816478 0.882332 0.843868 0.873227 0.017210 0.008906 0.011300 0.005445
19 1.310398 0.060889 0.831108 0.875388 80 2 {u'features__pca__n_components': 80, u'feature... 3 0.833061 0.869900 0.816118 0.882789 0.844147 0.873474 0.073944 0.006236 0.011526 0.005433
20 1.380372 0.056380 0.831770 0.875407 80 3 {u'features__pca__n_components': 80, u'feature... 1 0.834176 0.869754 0.817118 0.883051 0.844017 0.873415 0.165435 0.007861 0.011113 0.005608
---------- SOC
Cross_val_score:  [ 0.78031727  0.86680632  0.66199438]
Explained variance score:  0.88750452386
Mean absolute error:  0.245361854875
Mean squared error:  0.169464148411
R2 score:  0.887421323921
mean_fit_time mean_score_time mean_test_score mean_train_score param_features__pca__n_components param_features__univ_select__k params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.681192 0.035337 0.831118 0.847271 20 1 {u'features__pca__n_components': 20, u'feature... 21 0.837019 0.848215 0.841431 0.846416 0.814905 0.847182 0.196038 0.007060 0.011605 0.000737
1 0.920377 0.046202 0.838020 0.850516 20 2 {u'features__pca__n_components': 20, u'feature... 20 0.833762 0.853885 0.841310 0.846416 0.838989 0.851247 0.226873 0.009279 0.003156 0.003093
2 0.972239 0.033836 0.846720 0.854123 20 3 {u'features__pca__n_components': 20, u'feature... 18 0.835124 0.854699 0.860047 0.855094 0.844991 0.852575 0.321887 0.002034 0.010248 0.001106
3 0.958769 0.037154 0.842644 0.870048 30 1 {u'features__pca__n_components': 30, u'feature... 19 0.838136 0.874138 0.852772 0.867228 0.837026 0.868777 0.263317 0.011146 0.007175 0.002960
4 0.879586 0.049053 0.852473 0.873002 30 2 {u'features__pca__n_components': 30, u'feature... 16 0.839087 0.875450 0.861040 0.871233 0.857292 0.872322 0.231591 0.020132 0.009588 0.001787
5 0.652824 0.031191 0.854883 0.876695 30 3 {u'features__pca__n_components': 30, u'feature... 15 0.835836 0.882247 0.872538 0.875452 0.856274 0.872384 0.090116 0.005662 0.015016 0.004121
6 1.073847 0.046463 0.849617 0.880270 40 1 {u'features__pca__n_components': 40, u'feature... 17 0.840576 0.887842 0.855301 0.874839 0.852974 0.878129 0.279024 0.009716 0.006463 0.005521
7 0.755436 0.029027 0.856202 0.882690 40 2 {u'features__pca__n_components': 40, u'feature... 14 0.841762 0.888404 0.862255 0.880377 0.864588 0.879290 0.082151 0.000529 0.010255 0.004065
8 0.642800 0.030344 0.857144 0.884617 40 3 {u'features__pca__n_components': 40, u'feature... 13 0.838498 0.892341 0.868659 0.882216 0.864274 0.879295 0.023379 0.001403 0.013305 0.005590
9 0.813319 0.030596 0.864310 0.905646 50 1 {u'features__pca__n_components': 50, u'feature... 10 0.853666 0.920060 0.894942 0.892635 0.844322 0.904243 0.153815 0.000967 0.021993 0.011240
10 1.146746 0.042270 0.862659 0.906158 50 2 {u'features__pca__n_components': 50, u'feature... 12 0.855228 0.920705 0.891631 0.893504 0.841118 0.904265 0.153162 0.009870 0.021281 0.011185
11 1.270008 0.041130 0.863166 0.907123 50 3 {u'features__pca__n_components': 50, u'feature... 11 0.858097 0.921827 0.894196 0.894111 0.837204 0.905431 0.278981 0.007386 0.023542 0.011378
12 0.927872 0.034728 0.877852 0.921295 60 1 {u'features__pca__n_components': 60, u'feature... 7 0.867096 0.933388 0.904688 0.912402 0.861771 0.918096 0.224900 0.006435 0.019100 0.008861
13 0.743191 0.038732 0.876308 0.921907 60 2 {u'features__pca__n_components': 60, u'feature... 8 0.867181 0.934092 0.900236 0.913458 0.861506 0.918171 0.063640 0.003937 0.017078 0.008828
14 0.883135 0.042962 0.874126 0.922976 60 3 {u'features__pca__n_components': 60, u'feature... 9 0.867333 0.934435 0.903555 0.914610 0.851491 0.919884 0.141811 0.007816 0.021791 0.008384
15 0.966883 0.059085 0.889129 0.931740 70 1 {u'features__pca__n_components': 70, u'feature... 1 0.879558 0.941748 0.906662 0.927809 0.881168 0.925665 0.085121 0.017348 0.012415 0.007130
16 0.924690 0.035769 0.888482 0.931689 70 2 {u'features__pca__n_components': 70, u'feature... 2 0.878915 0.941584 0.906070 0.927813 0.880460 0.925671 0.147400 0.007986 0.012453 0.007051
17 1.197027 0.057927 0.885093 0.932364 70 3 {u'features__pca__n_components': 70, u'feature... 3 0.877150 0.941989 0.906860 0.928326 0.871269 0.926778 0.227215 0.016178 0.015578 0.006835
18 1.209791 0.054755 0.882019 0.937971 80 1 {u'features__pca__n_components': 80, u'feature... 5 0.879430 0.946098 0.908489 0.933400 0.858137 0.934414 0.113977 0.004346 0.020638 0.005762
19 0.994617 0.050183 0.882814 0.938203 80 2 {u'features__pca__n_components': 80, u'feature... 4 0.879762 0.946003 0.906888 0.933548 0.861791 0.935056 0.247823 0.011264 0.018537 0.005550
20 1.076414 0.045228 0.878012 0.938748 80 3 {u'features__pca__n_components': 80, u'feature... 6 0.877560 0.945987 0.905223 0.934264 0.851253 0.935995 0.065927 0.002806 0.022036 0.005167
---------- Sand
Cross_val_score:  [ 0.82821617  0.86618492  0.8445729 ]
Explained variance score:  0.889904449533
Mean absolute error:  0.248133611479
Mean squared error:  0.114786892406
R2 score:  0.889904422625
mean_fit_time mean_score_time mean_test_score mean_train_score param_features__pca__n_components param_features__univ_select__k params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.561545 0.024566 0.745354 0.764740 20 1 {u'features__pca__n_components': 20, u'feature... 21 0.765663 0.752979 0.765967 0.756153 0.704431 0.785088 0.043940 0.002628 0.028937 0.014447
1 0.791148 0.030129 0.745899 0.765695 20 2 {u'features__pca__n_components': 20, u'feature... 20 0.765696 0.753613 0.768838 0.758232 0.703164 0.785242 0.047267 0.004130 0.030246 0.013949
2 1.015889 0.044647 0.780635 0.796424 20 3 {u'features__pca__n_components': 20, u'feature... 19 0.819863 0.791244 0.818702 0.812514 0.703339 0.785513 0.127861 0.005863 0.054658 0.011616
3 0.743588 0.031161 0.793264 0.820243 30 1 {u'features__pca__n_components': 30, u'feature... 18 0.801728 0.819534 0.819046 0.808177 0.759020 0.833018 0.163085 0.009037 0.025226 0.010154
4 1.000340 0.035295 0.793597 0.820828 30 2 {u'features__pca__n_components': 30, u'feature... 17 0.802108 0.819614 0.820222 0.809728 0.758462 0.833142 0.163225 0.000691 0.025922 0.009597
5 1.054117 0.044869 0.812975 0.835391 30 3 {u'features__pca__n_components': 30, u'feature... 16 0.838397 0.833710 0.843497 0.839109 0.757032 0.833353 0.052726 0.013963 0.039613 0.002633
6 1.290851 0.046338 0.820953 0.847370 40 1 {u'features__pca__n_components': 40, u'feature... 15 0.834307 0.837526 0.841528 0.844604 0.787024 0.859982 0.082567 0.007841 0.024172 0.009374
7 1.342618 0.086918 0.822936 0.849022 40 2 {u'features__pca__n_components': 40, u'feature... 14 0.838136 0.839738 0.842859 0.847304 0.787813 0.860025 0.042285 0.036925 0.024911 0.008371
8 1.123704 0.042534 0.832006 0.855029 40 3 {u'features__pca__n_components': 40, u'feature... 13 0.851682 0.845261 0.855297 0.859586 0.789039 0.860241 0.066405 0.003886 0.030418 0.006912
9 1.072929 0.047421 0.841179 0.870672 50 1 {u'features__pca__n_components': 50, u'feature... 11 0.861228 0.861574 0.856392 0.869025 0.805917 0.881416 0.073628 0.008217 0.025012 0.008184
10 1.398958 0.045370 0.840036 0.871263 50 2 {u'features__pca__n_components': 50, u'feature... 12 0.858171 0.862400 0.856203 0.869956 0.805735 0.881434 0.084613 0.007820 0.024268 0.007825
11 1.189652 0.051122 0.846057 0.874803 50 3 {u'features__pca__n_components': 50, u'feature... 8 0.866792 0.864263 0.865702 0.878715 0.805677 0.881430 0.026558 0.003594 0.028556 0.007534
12 0.987679 0.061297 0.843475 0.883455 60 1 {u'features__pca__n_components': 60, u'feature... 10 0.873272 0.873400 0.849718 0.886600 0.807434 0.890366 0.059998 0.021631 0.027238 0.007274
13 1.008315 0.045531 0.843954 0.883888 60 2 {u'features__pca__n_components': 60, u'feature... 9 0.870699 0.873774 0.850150 0.886684 0.811012 0.891206 0.031385 0.002159 0.024758 0.007386
14 1.017900 0.045347 0.849303 0.886554 60 3 {u'features__pca__n_components': 60, u'feature... 6 0.876602 0.875591 0.859895 0.892863 0.811412 0.891208 0.037788 0.002632 0.027647 0.007781
15 1.170656 0.051632 0.848044 0.889931 70 1 {u'features__pca__n_components': 70, u'feature... 7 0.872191 0.880320 0.857370 0.891465 0.814569 0.898008 0.295230 0.004833 0.024431 0.007302
16 1.064834 0.046549 0.849466 0.890630 70 2 {u'features__pca__n_components': 70, u'feature... 5 0.872496 0.880335 0.860922 0.893421 0.814982 0.898133 0.188098 0.014237 0.024838 0.007530
17 1.440485 0.054635 0.853632 0.892716 70 3 {u'features__pca__n_components': 70, u'feature... 1 0.876783 0.881357 0.868840 0.898692 0.815274 0.898099 0.240996 0.003045 0.027316 0.008036
18 1.184202 0.134997 0.850907 0.897130 80 1 {u'features__pca__n_components': 80, u'feature... 4 0.867486 0.891621 0.863918 0.897550 0.821316 0.902219 0.308573 0.124889 0.020975 0.004337
19 1.436747 0.044015 0.852208 0.897854 80 2 {u'features__pca__n_components': 80, u'feature... 3 0.869798 0.892019 0.865289 0.899163 0.821537 0.902381 0.407091 0.004297 0.021766 0.004330
20 2.518527 0.055951 0.853198 0.898998 80 3 {u'features__pca__n_components': 80, u'feature... 2 0.869845 0.891799 0.868020 0.902717 0.821729 0.902478 1.491932 0.013486 0.022265 0.005091
Completed in 777.28 sec.

In [12]:
print len(y_pipelines_linsel)
print y_scores_linsel


5
[0.26768592334255725, 0.46275798213093117, 0.21075622360331972, 0.16946414841128216, 0.11478689240550471]

In [13]:
# Ridge Regression with PCA combinations

y_pipelines_ridge = []
y_scores_ridge = []

start = time.time()
for ind, y in enumerate(y_vars):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    # set up the train and test data
    print '\n----------', y_var_labels[ind]

    pca = PCA()
    ridge = Ridge()
    steps = [('pca', pca), ('ridge', ridge)]
    pipeline = Pipeline(steps)

    parameters = dict(pca__n_components=list(range(20, 90, 10)),
                     ridge__alpha=np.linspace(0.0, 0.5, 5))

    cv = GridSearchCV(pipeline, param_grid=parameters, verbose=0)
    cv.fit(X_train, y_train)   

    print 'Cross_val_score: ', cross_val_score(cv, X_test, y_test)
    
    y_predictions = cv.predict(X_test)
    mse = mean_squared_error(y_test, y_predictions)
    print 'Explained variance score: ', explained_variance_score(y_test, y_predictions)
    print 'Mean absolute error: ', mean_absolute_error(y_test, y_predictions)
    print 'Mean squared error: ', mse
    print 'R2 score: ', r2_score(y_test, y_predictions)
    
    display(pd.DataFrame.from_dict(cv.cv_results_))
    
    # capture the best pipeline estimator and mse value
    y_pipelines_ridge.append(cv.best_estimator_)
    y_scores_ridge.append(mse)
    
print '\nCompleted in %0.2f sec.' % (time.time()-start)


---------- Ca
Cross_val_score:  [ 0.94479266  0.92467325  0.94938945]
Explained variance score:  0.907809747623
Mean absolute error:  0.16756810999
Mean squared error:  0.149038100002
R2 score:  0.90672122208
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_ridge__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.765690 0.040823 0.813747 0.838217 20 0 {u'ridge__alpha': 0.0, u'pca__n_components': 20} 35 0.798094 0.843418 0.796691 0.849875 0.846455 0.821357 0.174340 0.010515 0.023135 0.012209
1 0.699280 0.038453 0.813770 0.838217 20 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 34 0.798232 0.843418 0.796638 0.849875 0.846439 0.821357 0.188765 0.011011 0.023110 0.012209
2 0.772648 0.032436 0.813792 0.838216 20 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 20} 33 0.798368 0.843417 0.796584 0.849874 0.846424 0.821356 0.101629 0.008526 0.023086 0.012209
3 0.629021 0.026633 0.813813 0.838215 20 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 32 0.798503 0.843416 0.796529 0.849873 0.846407 0.821355 0.042242 0.000664 0.023062 0.012210
4 0.593395 0.026582 0.813835 0.838213 20 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 20} 31 0.798638 0.843415 0.796475 0.849872 0.846391 0.821354 0.055693 0.001883 0.023038 0.012210
5 0.667628 0.037941 0.842323 0.880511 30 0 {u'ridge__alpha': 0.0, u'pca__n_components': 30} 27 0.814289 0.899489 0.834021 0.873560 0.878659 0.868484 0.068203 0.007430 0.026927 0.013579
6 0.585042 0.027114 0.842400 0.880508 30 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 26 0.814656 0.899485 0.833830 0.873559 0.878715 0.868481 0.035176 0.006366 0.026845 0.013578
7 0.633975 0.029933 0.842471 0.880500 30 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 30} 25 0.815009 0.899473 0.833639 0.873555 0.878764 0.868470 0.024262 0.002056 0.026766 0.013576
8 0.661003 0.027723 0.842535 0.880486 30 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 24 0.815350 0.899454 0.833449 0.873550 0.878806 0.868453 0.019471 0.001195 0.026690 0.013573
9 0.641085 0.039804 0.842594 0.880467 30 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 30} 23 0.815679 0.899428 0.833260 0.873542 0.878842 0.868430 0.063540 0.009893 0.026617 0.013569
10 0.802688 0.031047 0.869691 0.902631 40 0 {u'ridge__alpha': 0.0, u'pca__n_components': 40} 1 0.843622 0.908025 0.870285 0.904571 0.895167 0.895295 0.072657 0.001759 0.021047 0.005375
11 0.829771 0.055340 0.869686 0.902606 40 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 2 0.843377 0.908012 0.870018 0.904544 0.895662 0.895261 0.108577 0.025884 0.021347 0.005383
12 0.957638 0.032783 0.869649 0.902548 40 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 40} 3 0.843146 0.907976 0.869723 0.904479 0.896079 0.895187 0.060390 0.001252 0.021610 0.005397
13 1.008830 0.035508 0.869584 0.902457 40 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 4 0.842926 0.907920 0.869398 0.904380 0.896428 0.895070 0.073110 0.004803 0.021843 0.005419
14 0.973250 0.034331 0.869490 0.902340 40 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 40} 5 0.842707 0.907848 0.869055 0.904251 0.896709 0.894919 0.162220 0.014596 0.022048 0.005448
15 1.748681 0.080916 0.865212 0.908391 50 0 {u'ridge__alpha': 0.0, u'pca__n_components': 50} 10 0.837238 0.912732 0.868442 0.909926 0.889957 0.902516 0.157382 0.019044 0.021643 0.004310
16 1.645922 0.062043 0.866387 0.908335 50 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 9 0.838522 0.912688 0.868754 0.909863 0.891885 0.902454 0.090642 0.004597 0.021849 0.004315
17 1.629989 0.073718 0.867247 0.908197 50 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 50} 8 0.839434 0.912579 0.868885 0.909721 0.893422 0.902292 0.158946 0.013320 0.022071 0.004335
18 1.993785 0.122236 0.867884 0.908001 50 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 7 0.840096 0.912423 0.868896 0.909520 0.894660 0.902060 0.229004 0.018546 0.022287 0.004365
19 1.734759 0.060368 0.868341 0.907763 50 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 50} 6 0.840548 0.912243 0.868810 0.909270 0.895665 0.901776 0.023795 0.004793 0.022504 0.004404
20 1.356025 0.062770 0.848126 0.915155 60 0 {u'ridge__alpha': 0.0, u'pca__n_components': 60} 21 0.815637 0.915902 0.843193 0.922451 0.885549 0.907111 0.127267 0.012760 0.028754 0.006285
21 1.381160 0.046568 0.853293 0.914940 60 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 18 0.821575 0.915782 0.848940 0.922053 0.889364 0.906984 0.131632 0.003410 0.027846 0.006181
22 1.287287 0.070237 0.856883 0.914417 60 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 60} 14 0.825823 0.915440 0.852728 0.921258 0.892098 0.906554 0.055355 0.018392 0.027216 0.006046
23 1.201893 0.055680 0.859231 0.913775 60 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 12 0.828601 0.915077 0.855150 0.920350 0.893941 0.905898 0.024039 0.004030 0.026831 0.005972
24 1.261091 0.074142 0.861033 0.913106 60 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 60} 11 0.830814 0.914651 0.856950 0.919408 0.895336 0.905258 0.051907 0.005175 0.026499 0.005879
25 1.259008 0.066031 0.831026 0.922127 70 0 {u'ridge__alpha': 0.0, u'pca__n_components': 70} 29 0.776720 0.920207 0.838363 0.929646 0.877995 0.916526 0.065070 0.019865 0.041670 0.005525
26 1.303955 0.063705 0.843712 0.921323 70 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 22 0.797060 0.919643 0.846848 0.928758 0.887229 0.915567 0.042969 0.014786 0.036878 0.005515
27 1.374559 0.066812 0.850322 0.920011 70 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 70} 19 0.807340 0.918779 0.851641 0.927212 0.891985 0.914042 0.211652 0.018663 0.034569 0.005447
28 1.208757 0.069166 0.854620 0.918734 70 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 16 0.814199 0.917976 0.854744 0.925642 0.894915 0.912585 0.071623 0.014817 0.032952 0.005357
29 1.369407 0.065564 0.857417 0.917537 70 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 70} 13 0.818646 0.917208 0.856830 0.924173 0.896774 0.911230 0.086379 0.012453 0.031898 0.005289
30 1.390737 0.050595 0.820348 0.928880 80 0 {u'ridge__alpha': 0.0, u'pca__n_components': 80} 30 0.769511 0.926513 0.829591 0.937553 0.861941 0.922575 0.157764 0.005718 0.038296 0.006340
31 1.721177 0.072157 0.840360 0.926630 80 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 28 0.795797 0.924502 0.842981 0.935103 0.882303 0.920286 0.017246 0.012046 0.035364 0.006234
32 1.799088 0.057840 0.848832 0.924098 80 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 80} 20 0.807510 0.922473 0.849248 0.932120 0.889739 0.917700 0.108843 0.008311 0.033571 0.005998
33 1.876051 0.080055 0.853639 0.922031 80 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 17 0.814222 0.921001 0.853039 0.929634 0.893657 0.915459 0.276300 0.004767 0.032432 0.005833
34 1.763199 0.065530 0.856805 0.920293 80 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 80} 15 0.818757 0.919744 0.855486 0.927481 0.896170 0.913654 0.205922 0.014346 0.031618 0.005658
---------- P
Cross_val_score:  [ 0.00438324 -0.30412071 -0.13891538]
Explained variance score:  0.0430778454947
Mean absolute error:  0.420143008388
Mean squared error:  1.29003209448
R2 score:  0.0420195514128
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_ridge__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.940050 0.032695 0.059802 0.126268 20 0 {u'ridge__alpha': 0.0, u'pca__n_components': 20} 5 0.060626 0.151389 0.014544 0.122297 0.104236 0.105118 0.103132 0.005347 0.036621 0.019098
1 1.026579 0.037253 0.059835 0.126268 20 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 4 0.060637 0.151389 0.014609 0.122297 0.104258 0.105118 0.081426 0.004822 0.036603 0.019098
2 0.816109 0.034494 0.059867 0.126268 20 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 20} 3 0.060648 0.151389 0.014674 0.122296 0.104279 0.105118 0.108207 0.006729 0.036585 0.019098
3 0.769226 0.048212 0.059900 0.126267 20 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 2 0.060659 0.151388 0.014740 0.122296 0.104301 0.105118 0.120655 0.013759 0.036567 0.019097
4 0.734701 0.034125 0.059932 0.126267 20 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 20} 1 0.060670 0.151387 0.014804 0.122296 0.104323 0.105118 0.015650 0.004821 0.036549 0.019097
5 0.826794 0.042144 0.033974 0.163217 30 0 {u'ridge__alpha': 0.0, u'pca__n_components': 30} 16 0.077780 0.174519 -0.067169 0.173207 0.091309 0.141926 0.048337 0.016881 0.071731 0.015065
6 0.863384 0.046590 0.034901 0.163213 30 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 15 0.077686 0.174518 -0.064941 0.173200 0.091959 0.141923 0.040570 0.018586 0.070839 0.015064
7 0.854107 0.047770 0.035791 0.163205 30 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 30} 14 0.077592 0.174513 -0.062807 0.173189 0.092588 0.141915 0.103367 0.007617 0.069988 0.015065
8 0.883384 0.037655 0.036655 0.163191 30 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 13 0.077500 0.174507 -0.060733 0.173166 0.093197 0.141901 0.063532 0.004895 0.069161 0.015065
9 0.862960 0.119266 0.037492 0.163172 30 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 30} 12 0.077409 0.174497 -0.058722 0.173135 0.093788 0.141882 0.099215 0.125082 0.068361 0.015064
10 1.789298 0.046823 0.046766 0.218868 40 0 {u'ridge__alpha': 0.0, u'pca__n_components': 40} 10 0.102573 0.235445 -0.072094 0.230836 0.109818 0.190322 0.074392 0.010013 0.084099 0.020273
11 1.692598 0.052204 0.050600 0.218739 40 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 9 0.102611 0.235353 -0.062325 0.230578 0.111513 0.190284 0.063198 0.010064 0.079932 0.020215
12 1.696144 0.052499 0.053908 0.218639 40 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 40} 8 0.102526 0.235218 -0.053871 0.230491 0.113070 0.190209 0.312770 0.012363 0.076333 0.020196
13 1.314626 0.037559 0.056911 0.218367 40 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 7 0.102412 0.234945 -0.046194 0.230076 0.114514 0.190081 0.194818 0.007636 0.073073 0.020100
14 1.174848 0.045947 0.059566 0.218028 40 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 40} 6 0.102262 0.234580 -0.039407 0.229589 0.115843 0.189913 0.148055 0.003957 0.070204 0.019984
15 1.145359 0.042706 -0.022747 0.271030 50 0 {u'ridge__alpha': 0.0, u'pca__n_components': 50} 27 0.131288 0.277244 -0.338111 0.289536 0.138583 0.246310 0.068162 0.003710 0.223016 0.018186
16 1.425146 0.065153 -0.001446 0.270642 50 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 24 0.130557 0.276892 -0.279442 0.289008 0.144548 0.246024 0.232067 0.012412 0.196656 0.018096
17 1.448897 0.061244 0.014946 0.269684 50 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 50} 21 0.129636 0.276011 -0.233882 0.287745 0.149083 0.245297 0.273313 0.012850 0.176127 0.017897
18 1.148419 0.051772 0.027983 0.268364 50 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 17 0.128625 0.274807 -0.197258 0.285987 0.152584 0.244297 0.075692 0.002502 0.159570 0.017619
19 1.231852 0.040316 0.038428 0.266847 50 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 50} 11 0.127569 0.273416 -0.167608 0.284015 0.155324 0.243112 0.103552 0.005610 0.146130 0.017333
20 1.508210 0.060636 -0.086152 0.307794 60 0 {u'ridge__alpha': 0.0, u'pca__n_components': 60} 33 0.135286 0.299203 -0.435395 0.321852 0.041652 0.302326 0.351900 0.009171 0.249893 0.010022
21 1.278220 0.090570 -0.042552 0.306506 60 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 30 0.135409 0.298297 -0.340900 0.320835 0.077835 0.300387 0.056155 0.047434 0.212269 0.010168
22 1.019811 0.067887 -0.011325 0.303160 60 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 60} 26 0.134906 0.296279 -0.270801 0.316554 0.101920 0.296647 0.117075 0.029197 0.183971 0.009472
23 1.134464 0.054168 0.008503 0.299822 60 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 22 0.133919 0.293865 -0.225307 0.313376 0.116898 0.292226 0.144421 0.013312 0.165475 0.009607
24 1.042682 0.047178 0.024946 0.296310 60 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 60} 18 0.132922 0.291171 -0.185593 0.309681 0.127509 0.288079 0.115562 0.004579 0.148890 0.009538
25 1.578253 0.070204 -0.174827 0.339851 70 0 {u'ridge__alpha': 0.0, u'pca__n_components': 70} 34 0.147255 0.325145 -0.696960 0.365738 0.025224 0.328670 0.385025 0.009542 0.372550 0.018361
26 1.061788 0.055568 -0.081829 0.335544 70 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 32 0.146837 0.322213 -0.474995 0.359825 0.082672 0.324593 0.087758 0.018952 0.279242 0.017197
27 1.114252 0.048363 -0.033290 0.328603 70 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 70} 29 0.144614 0.317191 -0.356064 0.351060 0.111581 0.317557 0.020279 0.002673 0.228634 0.015880
28 0.885092 0.045094 -0.003050 0.321862 70 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 25 0.142436 0.312220 -0.280034 0.342614 0.128448 0.310752 0.056547 0.010622 0.195941 0.014686
29 0.913683 0.051663 0.017977 0.315698 70 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 70} 20 0.140302 0.307624 -0.225991 0.335168 0.139619 0.304302 0.046847 0.007784 0.172511 0.013834
30 1.125145 0.052959 -0.195412 0.366768 80 0 {u'ridge__alpha': 0.0, u'pca__n_components': 80} 35 0.153879 0.356326 -0.780853 0.392185 0.040737 0.351793 0.088874 0.008004 0.416538 0.018068
31 1.152800 0.044027 -0.080770 0.357187 80 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 31 0.153177 0.347351 -0.494454 0.380921 0.098966 0.343288 0.062259 0.001885 0.293354 0.016865
32 0.962932 0.051139 -0.031009 0.345050 80 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 80} 28 0.149570 0.336496 -0.365814 0.367094 0.123218 0.331561 0.056826 0.018885 0.236988 0.015717
33 1.181206 0.049515 -0.000260 0.335069 80 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 23 0.146629 0.327729 -0.285111 0.355551 0.137702 0.321926 0.233483 0.004681 0.201453 0.014676
34 1.083564 0.048439 0.020421 0.326705 80 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 80} 19 0.143928 0.320685 -0.229214 0.346037 0.146549 0.313394 0.052171 0.006003 0.176522 0.013990
---------- pH
Cross_val_score:  [ 0.75897715  0.66300039  0.70074625]
Explained variance score:  0.807972098124
Mean absolute error:  0.28537030838
Mean squared error:  0.165394995655
R2 score:  0.807073832319
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_ridge__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.713665 0.029470 0.712278 0.738207 20 0 {u'ridge__alpha': 0.0, u'pca__n_components': 20} 35 0.713510 0.737800 0.727684 0.732513 0.695640 0.744308 0.157680 0.005492 0.013111 0.004824
1 0.686452 0.026654 0.712310 0.738207 20 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 34 0.713576 0.737800 0.727661 0.732513 0.695692 0.744308 0.072326 0.000239 0.013082 0.004824
2 0.652361 0.026166 0.712342 0.738207 20 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 20} 33 0.713642 0.737799 0.727638 0.732513 0.695745 0.744307 0.052818 0.001837 0.013053 0.004824
3 0.709737 0.027731 0.712373 0.738206 20 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 32 0.713707 0.737799 0.727616 0.732513 0.695797 0.744307 0.063859 0.004515 0.013024 0.004824
4 0.744910 0.040063 0.712404 0.738205 20 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 20} 31 0.713771 0.737798 0.727592 0.732512 0.695848 0.744306 0.109441 0.005568 0.012995 0.004824
5 0.765550 0.034906 0.760194 0.795247 30 0 {u'ridge__alpha': 0.0, u'pca__n_components': 30} 30 0.765779 0.797234 0.768420 0.789412 0.746384 0.799095 0.073118 0.003039 0.009825 0.004195
6 0.607165 0.033988 0.760290 0.795242 30 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 29 0.765897 0.797228 0.768480 0.789408 0.746492 0.799089 0.080077 0.007988 0.009813 0.004195
7 0.812174 0.036755 0.760372 0.795226 30 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 30} 28 0.766001 0.797210 0.768530 0.789395 0.746586 0.799073 0.143862 0.012694 0.009803 0.004193
8 0.675556 0.029086 0.760443 0.795201 30 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 27 0.766092 0.797180 0.768572 0.789374 0.746666 0.799048 0.045361 0.001745 0.009795 0.004190
9 0.757490 0.026361 0.760504 0.795167 30 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 30} 26 0.766173 0.797141 0.768606 0.789346 0.746732 0.799013 0.106374 0.002289 0.009789 0.004186
10 1.004326 0.031007 0.767847 0.811019 40 0 {u'ridge__alpha': 0.0, u'pca__n_components': 40} 25 0.775230 0.808767 0.782748 0.807694 0.745562 0.816596 0.155329 0.002424 0.016054 0.003968
11 0.941760 0.037015 0.768211 0.810999 40 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 24 0.775287 0.808750 0.783094 0.807672 0.746253 0.816576 0.043365 0.002735 0.015851 0.003968
12 0.957856 0.047733 0.768515 0.810945 40 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 40} 23 0.775312 0.808702 0.783371 0.807615 0.746864 0.816517 0.159096 0.026980 0.015660 0.003965
13 1.788749 0.073910 0.768766 0.810854 40 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 22 0.775297 0.808610 0.783599 0.807523 0.747401 0.816430 0.466829 0.029107 0.015483 0.003968
14 2.079384 0.085714 0.768993 0.810747 40 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 40} 21 0.775303 0.808526 0.783778 0.807407 0.747898 0.816309 0.157598 0.035132 0.015313 0.003959
15 0.922867 0.023592 0.786862 0.826263 50 0 {u'ridge__alpha': 0.0, u'pca__n_components': 50} 20 0.789849 0.824527 0.797234 0.821899 0.773502 0.832364 0.636943 0.001381 0.009916 0.004445
16 0.456726 0.027163 0.787309 0.826154 50 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 18 0.789827 0.824424 0.797770 0.821815 0.774329 0.832223 0.021459 0.007519 0.009734 0.004422
17 1.400542 0.043042 0.787462 0.825887 50 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 50} 16 0.789675 0.824174 0.798092 0.821599 0.774618 0.831889 0.447254 0.004497 0.009710 0.004372
18 1.015742 0.042938 0.787425 0.825515 50 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 17 0.789419 0.823821 0.798252 0.821291 0.774603 0.831435 0.102922 0.005460 0.009757 0.004311
19 0.967621 0.045807 0.787285 0.825080 50 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 50} 19 0.789127 0.823408 0.798296 0.820919 0.774431 0.830915 0.136923 0.002359 0.009830 0.004249
20 0.771487 0.039382 0.797997 0.840871 60 0 {u'ridge__alpha': 0.0, u'pca__n_components': 60} 15 0.803330 0.836418 0.800506 0.841087 0.790154 0.845107 0.024739 0.002715 0.005664 0.003551
21 0.813312 0.036931 0.799311 0.840409 60 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 12 0.802768 0.835949 0.804316 0.840526 0.790848 0.844753 0.075700 0.004016 0.006017 0.003595
22 0.841290 0.034837 0.799323 0.839498 60 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 60} 11 0.801857 0.835234 0.806127 0.839499 0.789985 0.843761 0.076485 0.000247 0.006829 0.003481
23 0.702233 0.039731 0.798987 0.838360 60 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 13 0.800898 0.834229 0.806994 0.838161 0.789069 0.842688 0.008206 0.000754 0.007442 0.003456
24 0.825365 0.041508 0.798262 0.837080 60 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 60} 14 0.799701 0.833101 0.807064 0.836590 0.788021 0.841550 0.074353 0.006543 0.007841 0.003467
25 0.853566 0.044855 0.801879 0.849027 70 0 {u'ridge__alpha': 0.0, u'pca__n_components': 70} 10 0.804726 0.845967 0.807726 0.851594 0.793184 0.849520 0.090441 0.004537 0.006269 0.002324
26 0.827117 0.044277 0.804317 0.847934 70 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 6 0.807026 0.844838 0.811783 0.850184 0.794141 0.848780 0.121683 0.001020 0.007453 0.002263
27 0.788559 0.048669 0.804149 0.845938 70 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 70} 7 0.806773 0.842968 0.812750 0.847748 0.792923 0.847098 0.064485 0.007052 0.008304 0.002117
28 0.971023 0.045772 0.803383 0.843966 70 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 8 0.805739 0.841094 0.812781 0.845324 0.791629 0.845479 0.101355 0.004596 0.008795 0.002031
29 0.883760 0.061942 0.802451 0.842194 70 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 70} 9 0.804517 0.839323 0.812606 0.843182 0.790230 0.844078 0.083927 0.015905 0.009251 0.002063
30 1.198298 0.062364 0.812768 0.861001 80 0 {u'ridge__alpha': 0.0, u'pca__n_components': 80} 3 0.814959 0.862071 0.813178 0.858375 0.810165 0.862557 0.056655 0.017811 0.001978 0.001868
31 1.101930 0.065111 0.815235 0.857184 80 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 1 0.817578 0.857469 0.816133 0.855463 0.811994 0.858620 0.105750 0.031556 0.002366 0.001304
32 1.234728 0.051170 0.812933 0.853045 80 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 80} 2 0.815122 0.852606 0.816209 0.851822 0.807469 0.854707 0.016557 0.006053 0.003889 0.001218
33 0.900568 0.051179 0.810631 0.849668 80 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 4 0.812482 0.848674 0.815701 0.848665 0.803710 0.851664 0.081814 0.006695 0.005068 0.001412
34 0.980790 0.046382 0.808575 0.846994 80 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 80} 5 0.810173 0.845884 0.814938 0.845937 0.800615 0.849160 0.052158 0.006193 0.005955 0.001532
---------- SOC
Cross_val_score:  [ 0.8073643   0.92340718  0.74675892]
Explained variance score:  0.879134639056
Mean absolute error:  0.242371505471
Mean squared error:  0.149922075452
R2 score:  0.878855339683
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_ridge__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.720010 0.043975 0.822420 0.831415 20 0 {u'ridge__alpha': 0.0, u'pca__n_components': 20} 35 0.841579 0.814132 0.830051 0.832094 0.795631 0.848020 0.081377 0.010405 0.019519 0.013843
1 0.754234 0.035073 0.822437 0.831415 20 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 34 0.841681 0.814132 0.830036 0.832094 0.795595 0.848019 0.086142 0.002094 0.019566 0.013843
2 0.842878 0.035474 0.822454 0.831414 20 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 20} 33 0.841782 0.814132 0.830020 0.832093 0.795559 0.848019 0.095233 0.005932 0.019614 0.013843
3 0.951444 0.054223 0.822470 0.831413 20 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 32 0.841882 0.814131 0.830003 0.832092 0.795523 0.848017 0.074634 0.022568 0.019662 0.013842
4 0.794028 0.030531 0.822485 0.831412 20 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 20} 31 0.841982 0.814129 0.829987 0.832090 0.795487 0.848016 0.151801 0.010394 0.019709 0.013843
5 0.637704 0.034067 0.845652 0.860196 30 0 {u'ridge__alpha': 0.0, u'pca__n_components': 30} 30 0.862798 0.850248 0.852721 0.857510 0.821436 0.872829 0.074021 0.005860 0.017610 0.009412
6 0.716780 0.038168 0.845721 0.860194 30 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 29 0.863038 0.850246 0.852707 0.857509 0.821418 0.872827 0.066269 0.009404 0.017695 0.009412
7 0.715066 0.031909 0.845786 0.860189 30 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 30} 28 0.863272 0.850240 0.852690 0.857504 0.821397 0.872821 0.026747 0.004299 0.017779 0.009412
8 0.714386 0.034387 0.845847 0.860180 30 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 27 0.863499 0.850232 0.852670 0.857496 0.821371 0.872812 0.060837 0.011261 0.017862 0.009412
9 0.693144 0.025986 0.845903 0.860168 30 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 30} 26 0.863719 0.850220 0.852648 0.857485 0.821343 0.872800 0.106150 0.003424 0.017945 0.009412
10 0.777560 0.026810 0.846830 0.870692 40 0 {u'ridge__alpha': 0.0, u'pca__n_components': 40} 25 0.860049 0.862838 0.857236 0.863719 0.823204 0.885519 0.265909 0.010147 0.016745 0.010491
11 0.705610 0.025472 0.847216 0.870682 40 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 24 0.860692 0.862830 0.857338 0.863711 0.823617 0.885506 0.146339 0.007694 0.016743 0.010488
12 0.503975 0.024457 0.847575 0.870641 40 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 40} 23 0.861281 0.862796 0.857433 0.863688 0.824012 0.885439 0.025989 0.004997 0.016736 0.010470
13 0.488651 0.022388 0.847861 0.870592 40 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 22 0.861828 0.862746 0.857464 0.863646 0.824291 0.885384 0.026081 0.000858 0.016762 0.010466
14 0.474493 0.020221 0.848129 0.870525 40 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 40} 21 0.862332 0.862682 0.857496 0.863601 0.824559 0.885291 0.027656 0.000386 0.016783 0.010448
15 0.517152 0.022830 0.869316 0.898641 50 0 {u'ridge__alpha': 0.0, u'pca__n_components': 50} 20 0.880325 0.888976 0.877211 0.897964 0.850413 0.908982 0.036473 0.000350 0.013427 0.008182
16 0.558969 0.022919 0.870520 0.898462 50 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 19 0.882658 0.888799 0.878451 0.897760 0.850450 0.908826 0.034531 0.002128 0.014295 0.008191
17 0.544835 0.021742 0.871194 0.898039 50 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 50} 18 0.884160 0.888369 0.879166 0.897279 0.850255 0.908469 0.097934 0.000602 0.014946 0.008224
18 0.629820 0.035139 0.871496 0.897450 50 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 17 0.885114 0.887774 0.879513 0.896622 0.849859 0.907952 0.110058 0.015988 0.015469 0.008258
19 0.634268 0.026425 0.871583 0.896770 50 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 50} 16 0.885764 0.887097 0.879603 0.895857 0.849382 0.907357 0.067275 0.004444 0.015899 0.008296
20 0.726659 0.048819 0.887639 0.919337 60 0 {u'ridge__alpha': 0.0, u'pca__n_components': 60} 15 0.885777 0.918066 0.898911 0.915443 0.878230 0.924503 0.166173 0.034378 0.008545 0.003806
21 0.606033 0.032971 0.888949 0.918724 60 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 12 0.892289 0.917396 0.898840 0.914811 0.875718 0.923966 0.097069 0.005335 0.009730 0.003853
22 0.529828 0.030807 0.889081 0.917513 60 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 60} 11 0.895621 0.915956 0.898176 0.913537 0.873446 0.923047 0.076123 0.006458 0.011105 0.004036
23 0.513408 0.029102 0.888458 0.915975 60 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 13 0.897424 0.914120 0.896882 0.911998 0.871069 0.921806 0.067744 0.006953 0.012298 0.004213
24 0.452211 0.024225 0.887799 0.914377 60 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 60} 14 0.898413 0.912205 0.895931 0.910363 0.869053 0.920562 0.067644 0.001807 0.013294 0.004438
25 0.629607 0.029721 0.899821 0.933444 70 0 {u'ridge__alpha': 0.0, u'pca__n_components': 70} 7 0.905082 0.932337 0.909827 0.929937 0.884554 0.938058 0.206116 0.006871 0.010968 0.003407
26 0.596873 0.036275 0.901127 0.931565 70 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 4 0.909663 0.930187 0.910860 0.927943 0.882858 0.936565 0.171575 0.014767 0.012927 0.003652
27 0.474385 0.026652 0.899933 0.928634 70 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 70} 6 0.910851 0.927041 0.908850 0.924786 0.880099 0.934073 0.026425 0.003983 0.014049 0.003955
28 0.634331 0.029220 0.898239 0.925712 70 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 8 0.910621 0.923731 0.906712 0.921808 0.877382 0.931597 0.152486 0.007589 0.014834 0.004235
29 0.636655 0.067608 0.896359 0.922931 70 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 70} 10 0.909980 0.920640 0.904483 0.918987 0.874615 0.929165 0.115395 0.038267 0.015539 0.004459
30 0.770515 0.027708 0.902165 0.938585 80 0 {u'ridge__alpha': 0.0, u'pca__n_components': 80} 3 0.904061 0.937167 0.909979 0.935474 0.892456 0.943113 0.165425 0.003771 0.007278 0.003276
31 0.549637 0.031082 0.904151 0.935651 80 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 1 0.910915 0.933983 0.912235 0.932479 0.889302 0.940490 0.053027 0.004293 0.010513 0.003476
32 0.501546 0.031254 0.902389 0.931711 80 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 80} 2 0.911955 0.929791 0.910239 0.928346 0.884974 0.936995 0.021205 0.004459 0.012334 0.003783
33 0.481833 0.026170 0.900217 0.928177 80 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 5 0.911682 0.925950 0.907845 0.924638 0.881124 0.933942 0.002688 0.001251 0.013592 0.004112
34 0.541023 0.032170 0.898038 0.924986 80 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 80} 9 0.910923 0.922461 0.905432 0.921334 0.877761 0.931161 0.037512 0.009404 0.014513 0.004391
---------- Sand
Cross_val_score:  [ 0.85548356  0.82412032  0.81645869]
Explained variance score:  0.901543719316
Mean absolute error:  0.228587037029
Mean squared error:  0.0960361966804
R2 score:  0.900465588752
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_ridge__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.411805 0.017348 0.729940 0.754594 20 0 {u'ridge__alpha': 0.0, u'pca__n_components': 20} 31 0.765481 0.747800 0.735742 0.748182 0.688597 0.767799 0.113065 0.000816 0.031655 0.009339
1 0.363860 0.018793 0.729939 0.754593 20 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 32 0.765459 0.747800 0.735730 0.748181 0.688628 0.767799 0.065375 0.003456 0.031632 0.009339
2 0.430839 0.018745 0.729938 0.754593 20 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 20} 33 0.765436 0.747799 0.735718 0.748181 0.688659 0.767799 0.062937 0.001837 0.031609 0.009339
3 0.361225 0.017859 0.729936 0.754592 20 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 34 0.765413 0.747798 0.735705 0.748180 0.688689 0.767798 0.036439 0.000725 0.031587 0.009339
4 0.473450 0.016837 0.729934 0.754590 20 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 20} 35 0.765390 0.747796 0.735692 0.748178 0.688719 0.767797 0.172480 0.000689 0.031564 0.009340
5 0.374556 0.026793 0.787625 0.816354 30 0 {u'ridge__alpha': 0.0, u'pca__n_components': 30} 30 0.819840 0.804317 0.779890 0.826130 0.763145 0.818616 0.018382 0.008068 0.023783 0.009048
6 0.388613 0.017769 0.787672 0.816350 30 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 29 0.819682 0.804313 0.780292 0.826124 0.763042 0.818613 0.029263 0.000215 0.023705 0.009047
7 0.365967 0.018257 0.787710 0.816338 30 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 30} 28 0.819522 0.804302 0.780675 0.826108 0.762934 0.818605 0.031764 0.000647 0.023632 0.009045
8 0.557120 0.019635 0.787740 0.816319 30 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 27 0.819360 0.804284 0.781037 0.826081 0.762822 0.818591 0.133832 0.000459 0.023563 0.009043
9 0.371340 0.017993 0.787761 0.816293 30 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 30} 26 0.819196 0.804260 0.781380 0.826045 0.762706 0.818573 0.028460 0.000471 0.023499 0.009039
10 0.727140 0.029292 0.803770 0.841904 40 0 {u'ridge__alpha': 0.0, u'pca__n_components': 40} 25 0.829749 0.825610 0.807199 0.847921 0.774363 0.852182 0.063624 0.011768 0.022741 0.011652
11 0.459065 0.019679 0.804122 0.841870 40 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 24 0.829887 0.825590 0.807312 0.847867 0.775166 0.852154 0.033131 0.000828 0.022453 0.011644
12 0.600410 0.055568 0.804423 0.841801 40 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 40} 23 0.829978 0.825529 0.807426 0.847796 0.775865 0.852078 0.187695 0.031229 0.022193 0.011638
13 0.449386 0.020466 0.804647 0.841688 40 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 22 0.830019 0.825447 0.807462 0.847656 0.776459 0.851962 0.011631 0.001581 0.021956 0.011618
14 0.728989 0.031170 0.804827 0.841536 40 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 40} 21 0.830023 0.825307 0.807489 0.847489 0.776970 0.851812 0.240229 0.016664 0.021740 0.011611
15 0.763992 0.022717 0.825048 0.864462 50 0 {u'ridge__alpha': 0.0, u'pca__n_components': 50} 20 0.861719 0.850457 0.811890 0.872281 0.801536 0.870647 0.246416 0.002464 0.026273 0.009926
16 0.574064 0.026674 0.825820 0.864319 50 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 19 0.861404 0.850317 0.813516 0.872124 0.802542 0.870514 0.063027 0.007199 0.025557 0.009922
17 0.489928 0.023461 0.826211 0.863954 50 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 50} 18 0.860902 0.849965 0.814652 0.871727 0.803079 0.870170 0.007729 0.000392 0.024981 0.009912
18 0.727578 0.028882 0.826332 0.863444 50 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 16 0.860272 0.849468 0.815428 0.871171 0.803296 0.869695 0.219359 0.005668 0.024505 0.009901
19 0.703349 0.026372 0.826274 0.862845 50 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 50} 17 0.859572 0.848885 0.815955 0.870515 0.803294 0.869135 0.092791 0.008470 0.024106 0.009887
20 0.419603 0.023002 0.837390 0.880496 60 0 {u'ridge__alpha': 0.0, u'pca__n_components': 60} 15 0.872584 0.866806 0.825317 0.883512 0.814269 0.891169 0.043493 0.001872 0.025291 0.010172
21 0.490565 0.022357 0.839221 0.879997 60 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 13 0.873069 0.866389 0.826734 0.883105 0.817860 0.890498 0.077746 0.000681 0.024207 0.010085
22 0.517476 0.022597 0.839602 0.878848 60 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 60} 11 0.872347 0.865282 0.827196 0.882175 0.819264 0.889086 0.082241 0.000298 0.023379 0.009999
23 0.508410 0.026169 0.839298 0.877397 60 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 12 0.871334 0.863749 0.827265 0.881028 0.819296 0.887413 0.104365 0.004689 0.022885 0.009996
24 0.399664 0.022034 0.838716 0.875930 60 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 60} 14 0.869901 0.862206 0.827254 0.879830 0.818993 0.885754 0.036444 0.000767 0.022308 0.010001
25 0.480760 0.036084 0.843537 0.885899 70 0 {u'ridge__alpha': 0.0, u'pca__n_components': 70} 9 0.870983 0.873483 0.839386 0.889592 0.820242 0.894623 0.056159 0.010050 0.020922 0.009017
26 0.458405 0.029238 0.845261 0.884768 70 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 6 0.873326 0.872169 0.839084 0.888455 0.823372 0.893680 0.029315 0.002979 0.020856 0.009160
27 0.530435 0.031760 0.845156 0.883119 70 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 70} 7 0.873216 0.870564 0.838167 0.886881 0.824085 0.891913 0.023894 0.002669 0.020657 0.009112
28 0.564947 0.027127 0.844293 0.881176 70 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 8 0.872341 0.868443 0.836750 0.885153 0.823787 0.889931 0.061462 0.002891 0.020527 0.009212
29 0.573140 0.025191 0.843178 0.879323 70 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 70} 10 0.871173 0.866511 0.835466 0.883484 0.822897 0.887976 0.250323 0.002590 0.020449 0.009243
30 0.611503 0.034657 0.847648 0.895737 80 0 {u'ridge__alpha': 0.0, u'pca__n_components': 80} 4 0.873946 0.882509 0.843304 0.900108 0.825692 0.904593 0.020035 0.012576 0.019938 0.009531
31 0.582416 0.033209 0.851389 0.892390 80 0.125 {u'ridge__alpha': 0.125, u'pca__n_components':... 1 0.878829 0.879379 0.843756 0.896560 0.831580 0.901230 0.102950 0.001071 0.020030 0.009396
32 0.621096 0.031022 0.850388 0.888804 80 0.25 {u'ridge__alpha': 0.25, u'pca__n_components': 80} 2 0.878396 0.876065 0.841516 0.892852 0.831250 0.897495 0.109965 0.004601 0.020244 0.009205
33 0.608190 0.028120 0.848679 0.885709 80 0.375 {u'ridge__alpha': 0.375, u'pca__n_components':... 3 0.876757 0.872846 0.839453 0.889874 0.829828 0.894406 0.065336 0.001874 0.020239 0.009281
34 0.543245 0.028551 0.846869 0.883034 80 0.5 {u'ridge__alpha': 0.5, u'pca__n_components': 80} 5 0.874992 0.870105 0.837580 0.887341 0.828036 0.891657 0.005091 0.002987 0.020264 0.009311
Completed in 1043.49 sec.

In [14]:
print len(y_pipelines_ridge)
print y_scores_ridge


5
[0.14903810000194131, 1.2900320944830184, 0.16539499565487109, 0.14992207545220609, 0.096036196680399283]

In [15]:
# SVR with PCA combinations

y_pipelines_svr = []
y_scores_svr = []

start = time.time()
for ind, y in enumerate(y_vars):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    # set up the train and test data
    print '\n----------', y_var_labels[ind]

    pca = PCA()
    svr = SVR()
    steps = [('pca', pca), ('svr', svr)]
    pipeline = Pipeline(steps)

    parameters = dict(pca__n_components=list(range(20, 90, 10)),
                     svr__kernel=list(['rbf']),
                      svr__C=list([1e3]))

    cv = GridSearchCV(pipeline, param_grid=parameters, verbose=0)
    cv.fit(X_train, y_train)   

    print 'Cross_val_score: ', cross_val_score(cv, X_test, y_test)
    
    y_predictions = cv.predict(X_test)
    mse = mean_squared_error(y_test, y_predictions)
    print 'Explained variance score: ', explained_variance_score(y_test, y_predictions)
    print 'Mean absolute error: ', mean_absolute_error(y_test, y_predictions)
    print 'Mean squared error: ', mse
    print 'R2 score: ', r2_score(y_test, y_predictions)
    
    display(pd.DataFrame.from_dict(cv.cv_results_))
    
    # capture the best pipeline estimator and mse value
    y_pipelines_svr.append(cv.best_estimator_)
    y_scores_svr.append(mse)

print '\nCompleted in %0.2f sec.' % (time.time()-start)


---------- Ca
Cross_val_score:  [ 0.84166903  0.62536472  0.89655656]
Explained variance score:  0.85260545565
Mean absolute error:  0.178756573367
Mean squared error:  0.141106541199
R2 score:  0.852522246105
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_svr__C param_svr__kernel params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.461776 0.027719 0.824360 0.993966 20 1000 rbf {u'pca__n_components': 20, u'svr__kernel': u'r... 7 0.867014 0.994668 0.761285 0.992048 0.844781 0.995181 0.089763 0.004054 0.045515 0.001372
1 0.374501 0.027137 0.859268 0.994306 30 1000 rbf {u'pca__n_components': 30, u'svr__kernel': u'r... 5 0.891185 0.994681 0.819295 0.992727 0.867326 0.995509 0.009265 0.000292 0.029897 0.001166
2 0.482390 0.029775 0.867047 0.994454 40 1000 rbf {u'pca__n_components': 40, u'svr__kernel': u'r... 2 0.891914 0.994817 0.850231 0.992930 0.858997 0.995616 0.026068 0.001527 0.017944 0.001126
3 0.522716 0.031777 0.869248 0.994587 50 1000 rbf {u'pca__n_components': 50, u'svr__kernel': u'r... 1 0.889040 0.994996 0.871182 0.993042 0.847521 0.995722 0.031689 0.000342 0.017005 0.001132
4 0.444079 0.039923 0.866101 0.994624 60 1000 rbf {u'pca__n_components': 60, u'svr__kernel': u'r... 3 0.884664 0.995189 0.886004 0.993015 0.827634 0.995667 0.005919 0.002609 0.027205 0.001154
5 0.629485 0.047098 0.860802 0.994623 70 1000 rbf {u'pca__n_components': 70, u'svr__kernel': u'r... 4 0.879739 0.995319 0.896637 0.992929 0.806029 0.995621 0.105358 0.006839 0.039340 0.001204
6 0.611390 0.053320 0.854485 0.994536 80 1000 rbf {u'pca__n_components': 80, u'svr__kernel': u'r... 6 0.874961 0.995363 0.904068 0.992772 0.784427 0.995475 0.057646 0.008095 0.050944 0.001249
---------- P
Cross_val_score:  [-0.02119075 -0.4471562  -0.13813815]
Explained variance score:  0.160692111036
Mean absolute error:  0.399555862086
Mean squared error:  0.832064936251
R2 score:  0.160636328534
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_svr__C param_svr__kernel params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.521421 0.041120 0.015998 0.990803 20 1000 rbf {u'pca__n_components': 20, u'svr__kernel': u'r... 1 0.291230 0.986340 -0.160150 0.994045 -0.083088 0.992025 0.121289 0.016376 0.197145 0.003262
1 0.542441 0.035285 -0.125932 0.990643 30 1000 rbf {u'pca__n_components': 30, u'svr__kernel': u'r... 2 0.294767 0.986196 -0.572168 0.993652 -0.100394 0.992080 0.035109 0.004074 0.354385 0.003209
2 0.651699 0.037995 -0.330711 0.990484 40 1000 rbf {u'pca__n_components': 40, u'svr__kernel': u'r... 3 0.292329 0.985965 -1.123363 0.993558 -0.161099 0.991931 0.045386 0.003607 0.590267 0.003264
3 0.769732 0.042690 -0.582121 0.990284 50 1000 rbf {u'pca__n_components': 50, u'svr__kernel': u'r... 4 0.289955 0.985527 -1.813362 0.993419 -0.222957 0.991906 0.068993 0.002175 0.895446 0.003420
4 0.747739 0.051421 -0.847739 0.990197 60 1000 rbf {u'pca__n_components': 60, u'svr__kernel': u'r... 5 0.286070 0.985340 -2.544379 0.993392 -0.284910 0.991860 0.017656 0.005152 1.222141 0.003491
5 0.833574 0.056736 -1.088068 0.990180 70 1000 rbf {u'pca__n_components': 70, u'svr__kernel': u'r... 6 0.283503 0.985345 -3.212173 0.993346 -0.335535 0.991849 0.074633 0.003644 1.523082 0.003473
6 0.958503 0.063350 -1.304085 0.990134 80 1000 rbf {u'pca__n_components': 80, u'svr__kernel': u'r... 7 0.279366 0.985390 -3.807693 0.993229 -0.383928 0.991784 0.055273 0.006642 1.790908 0.003406
---------- pH
Cross_val_score:  [ 0.49146396  0.60775688  0.53948277]
Explained variance score:  0.759654115616
Mean absolute error:  0.305378698443
Mean squared error:  0.18574743249
R2 score:  0.759386593196
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_svr__C param_svr__kernel params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.864063 0.059998 0.693277 0.990107 20 1000 rbf {u'pca__n_components': 20, u'svr__kernel': u'r... 7 0.650987 0.990050 0.716984 0.990084 0.711860 0.990186 0.053757 0.036861 0.029977 0.000058
1 0.622419 0.056236 0.727309 0.990266 30 1000 rbf {u'pca__n_components': 30, u'svr__kernel': u'r... 6 0.684394 0.990273 0.759621 0.990232 0.737912 0.990294 0.201819 0.013072 0.031613 0.000026
2 0.630808 0.035387 0.733667 0.990279 40 1000 rbf {u'pca__n_components': 40, u'svr__kernel': u'r... 5 0.691438 0.990287 0.771125 0.990164 0.738439 0.990386 0.092006 0.001577 0.032706 0.000091
3 0.654573 0.038820 0.737350 0.990309 50 1000 rbf {u'pca__n_components': 50, u'svr__kernel': u'r... 4 0.699829 0.990435 0.775576 0.990112 0.736647 0.990381 0.073229 0.000495 0.030927 0.000141
4 0.618227 0.042926 0.739657 0.990281 60 1000 rbf {u'pca__n_components': 60, u'svr__kernel': u'r... 3 0.707220 0.990498 0.778187 0.990024 0.733564 0.990322 0.022088 0.000558 0.029291 0.000196
5 0.760805 0.053096 0.739911 0.990297 70 1000 rbf {u'pca__n_components': 70, u'svr__kernel': u'r... 2 0.709117 0.990609 0.779580 0.989901 0.731038 0.990381 0.111644 0.004063 0.029443 0.000295
6 0.807621 0.062126 0.740256 0.990271 80 1000 rbf {u'pca__n_components': 80, u'svr__kernel': u'r... 1 0.709871 0.990482 0.780707 0.989828 0.730191 0.990503 0.020965 0.010170 0.029782 0.000313
---------- SOC
Cross_val_score:  [ 0.74572577  0.74470343  0.81175305]
Explained variance score:  0.919766823404
Mean absolute error:  0.190788343947
Mean squared error:  0.0906953863834
R2 score:  0.919719121009
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_svr__C param_svr__kernel params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.502560 0.025971 0.819064 0.994234 20 1000 rbf {u'pca__n_components': 20, u'svr__kernel': u'r... 7 0.813139 0.993500 0.826827 0.994480 0.817226 0.994723 0.088705 0.001216 0.005738 0.000529
1 0.393891 0.031783 0.852599 0.994341 30 1000 rbf {u'pca__n_components': 30, u'svr__kernel': u'r... 6 0.854940 0.993663 0.863074 0.994601 0.839782 0.994759 0.005592 0.003410 0.009652 0.000484
2 0.547718 0.035388 0.869369 0.994481 40 1000 rbf {u'pca__n_components': 40, u'svr__kernel': u'r... 5 0.874086 0.993785 0.884414 0.994756 0.849608 0.994902 0.042738 0.003631 0.014596 0.000496
3 0.640083 0.045014 0.877214 0.994547 50 1000 rbf {u'pca__n_components': 50, u'svr__kernel': u'r... 4 0.881876 0.993897 0.896406 0.994741 0.853359 0.995003 0.066917 0.011442 0.017880 0.000472
4 0.561799 0.040721 0.879757 0.994608 60 1000 rbf {u'pca__n_components': 60, u'svr__kernel': u'r... 3 0.885022 0.994022 0.901702 0.994813 0.852546 0.994987 0.063221 0.002208 0.020410 0.000420
5 0.684491 0.048355 0.880639 0.994614 70 1000 rbf {u'pca__n_components': 70, u'svr__kernel': u'r... 1 0.886210 0.994051 0.905302 0.994785 0.850404 0.995007 0.048606 0.003646 0.022756 0.000409
6 0.735073 0.054397 0.880629 0.994629 80 1000 rbf {u'pca__n_components': 80, u'svr__kernel': u'r... 2 0.885768 0.994110 0.908235 0.994831 0.847883 0.994944 0.027493 0.004412 0.024905 0.000369
---------- Sand
Cross_val_score:  [ 0.74196547  0.72461023  0.77766828]
Explained variance score:  0.812795803822
Mean absolute error:  0.296725922719
Mean squared error:  0.182801681052
R2 score:  0.805827145753
mean_fit_time mean_score_time mean_test_score mean_train_score param_pca__n_components param_svr__C param_svr__kernel params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.638654 0.037018 0.781088 0.991369 20 1000 rbf {u'pca__n_components': 20, u'svr__kernel': u'r... 7 0.755036 0.991430 0.768031 0.991285 0.820198 0.991391 0.126758 0.002655 0.028159 0.000061
1 0.543212 0.036409 0.800760 0.991623 30 1000 rbf {u'pca__n_components': 30, u'svr__kernel': u'r... 6 0.786238 0.991839 0.782912 0.991246 0.833131 0.991784 0.081966 0.007712 0.022930 0.000268
2 0.766771 0.037592 0.809195 0.991631 40 1000 rbf {u'pca__n_components': 40, u'svr__kernel': u'r... 1 0.801452 0.991858 0.787881 0.991362 0.838252 0.991673 0.072058 0.005163 0.021280 0.000205
3 0.663927 0.059354 0.809106 0.991618 50 1000 rbf {u'pca__n_components': 50, u'svr__kernel': u'r... 2 0.806854 0.991918 0.781487 0.991306 0.838978 0.991630 0.079458 0.016503 0.023525 0.000250
4 0.670192 0.055156 0.808266 0.991540 60 1000 rbf {u'pca__n_components': 60, u'svr__kernel': u'r... 4 0.809043 0.991783 0.775542 0.991321 0.840214 0.991515 0.008540 0.016782 0.026408 0.000189
5 0.766627 0.056648 0.806201 0.991508 70 1000 rbf {u'pca__n_components': 70, u'svr__kernel': u'r... 5 0.809920 0.991755 0.768105 0.991255 0.840577 0.991514 0.101465 0.007482 0.029703 0.000204
6 1.007413 0.082079 0.808267 0.991526 80 1000 rbf {u'pca__n_components': 80, u'svr__kernel': u'r... 3 0.810324 0.991812 0.774154 0.991238 0.840323 0.991529 0.076931 0.026966 0.027052 0.000234
Completed in 153.84 sec.

In [16]:
print len(y_pipelines_svr)
print y_scores_svr


5
[0.14110654119878596, 0.83206493625090205, 0.18574743249018208, 0.090695386383408447, 0.18280168105219741]

In [17]:
# Pick out the best performing models/pipelines based on mse for each predictor

# combine results lists from modeling cells
y_vars_pipelines = [y_pipelines_lin, y_pipelines_linsel, y_pipelines_ridge, y_pipelines_svr]
y_vars_scores = [y_scores_lin, y_scores_linsel, y_scores_ridge, y_scores_svr]

pipelines = np.array(y_vars_pipelines)
scores = np.array(y_vars_scores)

pipeline_winners = []

print pipelines.shape

# P sucks
print scores

for ind, y in enumerate(y_vars):
    # get index of best score
    best_ind = np.argmin(scores[:,ind])
    print(best_ind, ind)
    # capture the pipeline
    pipeline_winners.append(pipelines[best_ind, ind])
    
print len(pipeline_winners)


(4, 5)
[[ 0.09961655  1.47693037  0.14844391  0.16821465  0.12908094]
 [ 0.26768592  0.46275798  0.21075622  0.16946415  0.11478689]
 [ 0.1490381   1.29003209  0.165395    0.14992208  0.0960362 ]
 [ 0.14110654  0.83206494  0.18574743  0.09069539  0.18280168]]
(0, 0)
(1, 1)
(0, 2)
(3, 3)
(2, 4)
5

In [18]:
# Iterate through test samples

allPredictions = []
pipeline_winners = y_pipelines_lin

for s_ind in range(len(test_x)):
    
    sampleId = test_ids[s_ind]
    sample = test_x[s_ind]
    
    currentSamplePredictions = []
    
    # Use the winning model to estimate the outcome variables
    for ind in range(0, 5):      
        pred = pipeline_winners[ind].predict(sample.reshape(1,-1))[0]     
        currentSamplePredictions.append(pred)
    
    allPredictions.append(currentSamplePredictions) 
    #print len(allPredictions)
    
#print allPredictions
print 'Predictions calculated.'


Predictions calculated.

In [21]:
# Generate csv for AfricaSoil Kaggle

filename = 'jsccjc_20170423_2.csv'

# Clean file
open(filename, 'w').close()
with open(filename, 'w') as f:
    f.write('PIDN,Ca,P,pH,SOC,Sand\n')  # python will convert \n to os.linesep

    # Iterate through test samples
    for i in range(len(allPredictions)):
        pred = allPredictions[i]
        testId = test_ids[i]
        text = testId + ',' + str(pred[0]) + ',' + str(pred[1]) + ',' + str(pred[2]) + ',' + str(pred[3]) + ',' + str(pred[4]) + '\n'
        f.write(text) 
    
f.close()

In [20]:
# Check where jupyter may drop the csv if it can't be found where expected

import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
print fileDir


/Users/jcasper/Documents/Education/UCBerkeley-DS/2017_Spring/DATASCIW207_ML/kaggle_africa_soil/notebook

NOTES

Below is a series of notes collected throughout the course of this final project. They are captured as references.

Options for high dimensional data where large number of features and fewer number of observations: can choose random sets of variables and asses their importance using cross-validation; ridge regression, the lasso or elastic net for regularization (process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting); choose a technique, such as a support vector machine or random forest that deals well with a large number of predictors. Refer to reference list for original source.

LASSO (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

When considering ML methods, consider:

  • the size of the training data
  • the number of features
  • the quality of features
  • the number of unique class labels
  • linear vs. non-linear problems

Always start simple: first algorithm to try would be naive Bayes, logistic regression, k-nearest neighbour (First start with one neighbour) and Fisher's linear discriminant before anything else. For advanced machine learning, ensemble methods are the ones that produces the best results as is shown by winners in kaggale competition and XGBOOST has been very popular among the kaggale winners. Neural Networks may be useful for predicting values but number of observations is low. Refer to reference list for original source.

Subject: dirt quality for agriculture, Predictor variables: 3593 features (see feature_names), Response variables: 'Ca', 'P', 'pH', 'Soc', 'Sand

A continuous predictor variable is sometimes called a covariate and a categorical predictor variable is sometimes called a factor. For exampe, in a cake experiment a covariate could be various oven temperatures and a factor could be different ovens. Usually, you create a plot of predictor variables on the x-axis and response variables on the y-axis. Refer to reference list for original source.

For continuous variables such as income, it is customary to do a log transformation to get it as close to a normal distribution as possible. You can then employ OLS and run some diagnostics to check your model fit. For other types of continuous variables, get a histogram and check the distribution. If it is somewhat normal, you can run an OLS and check the diagnostics and model fit. Refer to reference list for original source.

General References:

SKLearn References: